# [DW team learning - hands-on data analysis] Chapter 3: Model building and evaluation - evaluation

## Chapter III Model Building and Evaluation - Evaluation

According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the data set partitioning and other operations we know. How do we know whether a model is useful? So can we use the results of the model with confidence? Then today's assessment of learning will be very helpful.

```import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
```
```%matplotlib inline
```
```plt.rcParams['font.sans-serif'] = ['SimHei']  # Used to display Chinese labels normally
plt.rcParams['axes.unicode_minus'] = False  # Used to display the minus sign normally
plt.rcParams['figure.figsize'] = (10, 6)  # Set Output Picture Size
```

```#Write Code
```
PassengerIdPclassAgeSibSpParchFareSex_femaleSex_maleEmbarked_CEmbarked_QEmbarked_S
00322.0107.250001001
11138.01071.283310100
22326.0007.925010001
33135.01053.100010001
44335.0008.050001001
```y = pd.read_csv("train.csv")["Survived"]
```
```0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64
```
```from sklearn.model_selection import train_test_split
```
```#Write Code
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)
```
```X_train.info(),y_train
```
```<class 'pandas.core.frame.DataFrame'>
Int64Index: 668 entries, 671 to 80
Data columns (total 11 columns):
#   Column       Non-Null Count  Dtype
---  ------       --------------  -----
0   PassengerId  668 non-null    int64
1   Pclass       668 non-null    int64
2   Age          668 non-null    float64
3   SibSp        668 non-null    int64
4   Parch        668 non-null    int64
5   Fare         668 non-null    float64
6   Sex_female   668 non-null    int64
7   Sex_male     668 non-null    int64
8   Embarked_C   668 non-null    int64
9   Embarked_Q   668 non-null    int64
10  Embarked_S   668 non-null    int64
dtypes: float64(2), int64(9)
memory usage: 82.6 KB

(None,
671    0
417    1
634    0
323    1
379    0
..
131    0
490    0
528    0
48     0
80     0
Name: Survived, Length: 668, dtype: int64)
```
```X_train.dtypes,y_train.dtypes
```
```(PassengerId      int64
Pclass           int64
Age            float64
SibSp            int64
Parch            int64
Fare           float64
Sex_female       int64
Sex_male         int64
Embarked_C       int64
Embarked_Q       int64
Embarked_S       int64
dtype: object,
dtype('int64'))
```
```lr = LogisticRegression()
lr.fit(X_train, y_train)
```
```---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_8652/2079915614.py in <module>
1 lr = LogisticRegression()
----> 2 lr.fit(X_train, y_train)

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight)
1599                       penalty=penalty, max_squared_sum=max_squared_sum,
1600                       sample_weight=sample_weight)
-> 1601             for class_, warm_start_coef_ in zip(classes_, warm_start_coef))
1602
1603         fold_coefs_, _, n_iter_ = zip(*fold_coefs_)

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self, iterable)
1027             # remaining jobs.
1028             self._iterating = False
-> 1029             if self.dispatch_one_batch(iterator):
1030                 self._iterating = self._original_iterator is not None
1031

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator)
845                 return False
846             else:
848                 return True
849

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in _dispatch(self, batch)
763         with self._lock:
764             job_idx = len(self._jobs)
--> 765             job = self._backend.apply_async(batch, callback=cb)
766             # A job can complete so quickly than its callback is
767             # called before we get here, causing self._jobs to

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback)
206     def apply_async(self, func, callback=None):
207         """Schedule a func to be run"""
--> 208         result = ImmediateResult(func)
209         if callback:
210             callback(result)

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch)
570         # Don't delay the application, to avoid keeping the input
571         # arguments in memory
--> 572         self.results = batch()
573
574     def get(self):

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self)
251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
254
255     def __reduce__(self):

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in <listcomp>(.0)
251         with parallel_backend(self._backend, n_jobs=self._n_jobs):
252             return [func(*args, **kwargs)
--> 253                     for func, args, kwargs in self.items]
254
255     def __reduce__(self):

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio)
938             n_iter_i = _check_optimize_result(
939                 solver, opt_res, max_iter,
--> 940                 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)
941             w0, loss = opt_res.x, opt_res.fun
942         elif solver == 'newton-cg':

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg)
241                 "    https://scikit-learn.org/stable/modules/"
242                 "preprocessing.html"
--> 243             ).format(solver, result.status, result.message.decode("latin1"))
244             if extra_warning_msg is not None:
245                 warning_msg += "\n" + extra_warning_msg

AttributeError: 'str' object has no attribute 'decode'
```
```#Write Code
# Default parameter logistic regression model
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
rfc.fit(X_train, y_train)
```
```RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
```
```#Write Code
print("Training set score:{:.2f}".format(rfc.score(X_train, y_train)))
print("Testing set score:{:.2f}".format(rfc.score(X_test, y_test)))

```
```Training set score:0.86
Testing set score:0.81
```

### Model evaluation

• Model evaluation is to know the generalization ability of the model.
• Cross validation is a statistical method to evaluate generalization performance, which is more stable and comprehensive than the method of single division of training set and test set.
• In cross validation, data is divided multiple times and multiple models need to be trained.
• The most commonly used cross validation is k-fold cross validation, where k is a number specified by the user, usually 5 or 10.
• The precision measures how many of the samples predicted as positive examples are real positive examples
• recall measures how many positive samples are predicted to be positive
• f-score is the harmonic average of accuracy and recall

[Thinking]: To further understand the above concepts, you can make a summary

• Use 10 fold cross validation to evaluate the previous logistic regression model
• Calculate the average value of cross validation accuracy
```#Tip: Cross validation
Image('Snipaste_2020-01-05_16-37-56.png')
``` #### Prompt 4

• The module of cross validation in sklearn is sklearn.model_selection

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score#sklearn.model_selection.cross_val_score

```#Write Code
from sklearn.model_selection import cross_val_score
```
```#Write Code
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
scores = cross_val_score(rfc, X_train, y_train, cv=10) #Determine the cross validation split strategy. cv=None, use the default 50 fold cross validation
```
```# k-fold cross validation score
scores
```
```array([0.85074627, 0.79104478, 0.85074627, 0.7761194 , 0.80597015,
0.85074627, 0.76119403, 0.82089552, 0.77272727, 0.77272727])
```
```# Average Cross Validation Score
print("Average cross-validation score: {:.2f}".format(scores.mean()))
```

#### Thinking 4

• What kind of impact will the more k discounts bring?
A: The more k discounts, the more training set data during each training, and the less test set data.
The fitting effect of the training set with sufficient data may be better, and the more training times, the higher the stability and fidelity of the evaluation results
In addition, too many folds means that there are many training times, and the time of a single training will become longer under the condition of large data volume, that is, the complexity will become higher.

• Calculating the confusion matrix of binary classification problem
• Calculate accuracy rate, recall rate and f-score

[Thinking] What is the confusion matrix of the binary classification problem? Understand the concept and know what tasks it is mainly calculated into
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163

```#Tip: confusion matrix
Image('Snipaste_2020-01-05_16-38-26.png')
``` ```#Prompt: Accuracy, Precision, Recall,f-score calculation method
Image('Snipaste_2020-01-05_16-39-27.png')
``` #### Tip 5

• The method of confusion matrix is in the sklearn.metrics module of sklearn
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html?highlight=confusion_matrix#sklearn.metrics.confusion_matrix
• The confusion matrix needs to input real tags and forecast tags
• Classification can be used for accuracy rate, recall rate and f-score_ Report module
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report
```#Write Code
from sklearn.metrics import confusion_matrix
```
```#Write Code
rfc = RandomForestClassifier(n_estimators=100, max_depth = 5)
rfc.fit(X_train, y_train)
```
```RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
criterion='gini', max_depth=5, max_features='auto',
max_leaf_nodes=None, max_samples=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
```
```#Write Code
pred = rfc.predict(X_train)

```
```#Confusion matrix
confusion_matrix(y_train, pred)

```
```array([[391,  21],
[ 64, 192]], dtype=int64)
```
```from sklearn.metrics import classification_report
```
```# Accuracy, recall and f1 score
print(classification_report(y_train, pred))
```
```              precision    recall  f1-score   support

0       0.86      0.95      0.90       412
1       0.90      0.75      0.82       256

accuracy                           0.87       668
macro avg       0.88      0.85      0.86       668
weighted avg       0.88      0.87      0.87       668
```

[Thinking]

• What should I pay attention to when implementing the confusion matrix
A: Reference https://blog.csdn.net/zhangxiaohua18/article/details/122311808

• Draw ROC curve

[Thinking] What is ROC curve and what problems does OCR curve exist to solve?
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163

#### Tip 6

• The module of ROC curve in skylearn is skylearn.metrics
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html?highlight=roc_curve#sklearn.metrics.roc_curve
• The larger the area enclosed under the ROC curve, the better
```#Write Code
from sklearn.metrics import roc_curve
```
```#Write Code
fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
plt.plot(fpr, tpr, label="ROC Curve")
plt.xlabel("FPR")
plt.ylabel("TPR (recall)")
# Found the threshold closest to 0
close_zero = np.argmin(np.abs(thresholds))
plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2)
plt.legend(loc=4)
```
```---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

~\AppData\Local\Temp/ipykernel_8652/1348101448.py in <module>
1 #Write Code
----> 2 fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test))
3 plt.plot(fpr, tpr, label="ROC Curve")
4 plt.xlabel("FPR")
5 plt.ylabel("TPR (recall)")

C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X)
268         X = check_array(X, accept_sparse='csr')
269
--> 270         n_features = self.coef_.shape
271         if X.shape != n_features:
272             raise ValueError("X has %d features per sample; expecting %d"

AttributeError: 'list' object has no attribute 'shape'
```
```#Write Code

```
```#Write Code

```

#### Thinking 6

• How to draw ROC curve for multi classification problem

[Thinking] What information can you get from this OCR curve? What can this information do?

Tags: Python Data Analysis

Posted by vocoder on Fri, 23 Sep 2022 02:21:35 +0930