Chapter III Model Building and Evaluation - Evaluation
According to the modeling of the previous model, we know how to use the sklearn library to complete the modeling, as well as the data set partitioning and other operations we know. How do we know whether a model is useful? So can we use the results of the model with confidence? Then today's assessment of learning will be very helpful.
Load the following libraries
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt from IPython.display import Image from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display the minus sign normally plt.rcParams['figure.figsize'] = (10, 6) # Set Output Picture Size
Task: Load data and split test set and training set
#Write Code X = pd.read_csv("clear_data.csv") X.head()
PassengerId | Pclass | Age | SibSp | Parch | Fare | Sex_female | Sex_male | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 22.0 | 1 | 0 | 7.2500 | 0 | 1 | 0 | 0 | 1 |
1 | 1 | 1 | 38.0 | 1 | 0 | 71.2833 | 1 | 0 | 1 | 0 | 0 |
2 | 2 | 3 | 26.0 | 0 | 0 | 7.9250 | 1 | 0 | 0 | 0 | 1 |
3 | 3 | 1 | 35.0 | 1 | 0 | 53.1000 | 1 | 0 | 0 | 0 | 1 |
4 | 4 | 3 | 35.0 | 0 | 0 | 8.0500 | 0 | 1 | 0 | 0 | 1 |
y = pd.read_csv("train.csv")["Survived"] y.head()
0 0 1 1 2 1 3 1 4 0 Name: Survived, dtype: int64
from sklearn.model_selection import train_test_split
#Write Code X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 0)
X_train.info(),y_train
<class 'pandas.core.frame.DataFrame'> Int64Index: 668 entries, 671 to 80 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 668 non-null int64 1 Pclass 668 non-null int64 2 Age 668 non-null float64 3 SibSp 668 non-null int64 4 Parch 668 non-null int64 5 Fare 668 non-null float64 6 Sex_female 668 non-null int64 7 Sex_male 668 non-null int64 8 Embarked_C 668 non-null int64 9 Embarked_Q 668 non-null int64 10 Embarked_S 668 non-null int64 dtypes: float64(2), int64(9) memory usage: 82.6 KB (None, 671 0 417 1 634 0 323 1 379 0 .. 131 0 490 0 528 0 48 0 80 0 Name: Survived, Length: 668, dtype: int64)
X_train.dtypes,y_train.dtypes
(PassengerId int64 Pclass int64 Age float64 SibSp int64 Parch int64 Fare float64 Sex_female int64 Sex_male int64 Embarked_C int64 Embarked_Q int64 Embarked_S int64 dtype: object, dtype('int64'))
lr = LogisticRegression() lr.fit(X_train, y_train)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_8652/2079915614.py in <module> 1 lr = LogisticRegression() ----> 2 lr.fit(X_train, y_train) C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in fit(self, X, y, sample_weight) 1599 penalty=penalty, max_squared_sum=max_squared_sum, 1600 sample_weight=sample_weight) -> 1601 for class_, warm_start_coef_ in zip(classes_, warm_start_coef)) 1602 1603 fold_coefs_, _, n_iter_ = zip(*fold_coefs_) C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self, iterable) 1027 # remaining jobs. 1028 self._iterating = False -> 1029 if self.dispatch_one_batch(iterator): 1030 self._iterating = self._original_iterator is not None 1031 C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in dispatch_one_batch(self, iterator) 845 return False 846 else: --> 847 self._dispatch(tasks) 848 return True 849 C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in _dispatch(self, batch) 763 with self._lock: 764 job_idx = len(self._jobs) --> 765 job = self._backend.apply_async(batch, callback=cb) 766 # A job can complete so quickly than its callback is 767 # called before we get here, causing self._jobs to C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in apply_async(self, func, callback) 206 def apply_async(self, func, callback=None): 207 """Schedule a func to be run""" --> 208 result = ImmediateResult(func) 209 if callback: 210 callback(result) C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\_parallel_backends.py in __init__(self, batch) 570 # Don't delay the application, to avoid keeping the input 571 # arguments in memory --> 572 self.results = batch() 573 574 def get(self): C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in __call__(self) 251 with parallel_backend(self._backend, n_jobs=self._n_jobs): 252 return [func(*args, **kwargs) --> 253 for func, args, kwargs in self.items] 254 255 def __reduce__(self): C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\joblib\parallel.py in <listcomp>(.0) 251 with parallel_backend(self._backend, n_jobs=self._n_jobs): 252 return [func(*args, **kwargs) --> 253 for func, args, kwargs in self.items] 254 255 def __reduce__(self): C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_logistic.py in _logistic_regression_path(X, y, pos_class, Cs, fit_intercept, max_iter, tol, verbose, solver, coef, class_weight, dual, penalty, intercept_scaling, multi_class, random_state, check_input, max_squared_sum, sample_weight, l1_ratio) 938 n_iter_i = _check_optimize_result( 939 solver, opt_res, max_iter, --> 940 extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG) 941 w0, loss = opt_res.x, opt_res.fun 942 elif solver == 'newton-cg': C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\utils\optimize.py in _check_optimize_result(solver, result, max_iter, extra_warning_msg) 241 " https://scikit-learn.org/stable/modules/" 242 "preprocessing.html" --> 243 ).format(solver, result.status, result.message.decode("latin1")) 244 if extra_warning_msg is not None: 245 warning_msg += "\n" + extra_warning_msg AttributeError: 'str' object has no attribute 'decode'
#Write Code # Default parameter logistic regression model rfc = RandomForestClassifier(n_estimators=100, max_depth = 5) rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
#Write Code print("Training set score:{:.2f}".format(rfc.score(X_train, y_train))) print("Testing set score:{:.2f}".format(rfc.score(X_test, y_test)))
Training set score:0.86 Testing set score:0.81
Model evaluation
- Model evaluation is to know the generalization ability of the model.
- Cross validation is a statistical method to evaluate generalization performance, which is more stable and comprehensive than the method of single division of training set and test set.
- In cross validation, data is divided multiple times and multiple models need to be trained.
- The most commonly used cross validation is k-fold cross validation, where k is a number specified by the user, usually 5 or 10.
- The precision measures how many of the samples predicted as positive examples are real positive examples
- recall measures how many positive samples are predicted to be positive
- f-score is the harmonic average of accuracy and recall
[Thinking]: To further understand the above concepts, you can make a summary
Task 1: Cross validation
- Use 10 fold cross validation to evaluate the previous logistic regression model
- Calculate the average value of cross validation accuracy
#Tip: Cross validation Image('Snipaste_2020-01-05_16-37-56.png')
Prompt 4
- The module of cross validation in sklearn is sklearn.model_selection
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html?highlight=cross_val_score#sklearn.model_selection.cross_val_score
#Write Code from sklearn.model_selection import cross_val_score
#Write Code rfc = RandomForestClassifier(n_estimators=100, max_depth = 5) scores = cross_val_score(rfc, X_train, y_train, cv=10) #Determine the cross validation split strategy. cv=None, use the default 50 fold cross validation
# k-fold cross validation score scores
array([0.85074627, 0.79104478, 0.85074627, 0.7761194 , 0.80597015, 0.85074627, 0.76119403, 0.82089552, 0.77272727, 0.77272727])
# Average Cross Validation Score print("Average cross-validation score: {:.2f}".format(scores.mean()))
Thinking 4
- What kind of impact will the more k discounts bring?
A: The more k discounts, the more training set data during each training, and the less test set data.
The fitting effect of the training set with sufficient data may be better, and the more training times, the higher the stability and fidelity of the evaluation results
In addition, too many folds means that there are many training times, and the time of a single training will become longer under the condition of large data volume, that is, the complexity will become higher.
Task 2: confusion matrix
- Calculating the confusion matrix of binary classification problem
- Calculate accuracy rate, recall rate and f-score
[Thinking] What is the confusion matrix of the binary classification problem? Understand the concept and know what tasks it is mainly calculated into
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163
#Tip: confusion matrix Image('Snipaste_2020-01-05_16-38-26.png')
#Prompt: Accuracy, Precision, Recall,f-score calculation method Image('Snipaste_2020-01-05_16-39-27.png')
Tip 5
- The method of confusion matrix is in the sklearn.metrics module of sklearn
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html?highlight=confusion_matrix#sklearn.metrics.confusion_matrix - The confusion matrix needs to input real tags and forecast tags
- Classification can be used for accuracy rate, recall rate and f-score_ Report module
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report
#Write Code from sklearn.metrics import confusion_matrix
#Write Code rfc = RandomForestClassifier(n_estimators=100, max_depth = 5) rfc.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=5, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
#Write Code pred = rfc.predict(X_train)
#Confusion matrix confusion_matrix(y_train, pred)
array([[391, 21], [ 64, 192]], dtype=int64)
from sklearn.metrics import classification_report
# Accuracy, recall and f1 score print(classification_report(y_train, pred))
precision recall f1-score support 0 0.86 0.95 0.90 412 1 0.90 0.75 0.82 256 accuracy 0.87 668 macro avg 0.88 0.85 0.86 668 weighted avg 0.88 0.87 0.87 668
[Thinking]
- What should I pay attention to when implementing the confusion matrix
A: Reference https://blog.csdn.net/zhangxiaohua18/article/details/122311808
Task 3: ROC curve
- Draw ROC curve
[Thinking] What is ROC curve and what problems does OCR curve exist to solve?
A: I learned this in the previous [melon eating tutorial] https://blog.csdn.net/sinat_33209811/article/details/125755163
Tip 6
- The module of ROC curve in skylearn is skylearn.metrics
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html?highlight=roc_curve#sklearn.metrics.roc_curve - The larger the area enclosed under the ROC curve, the better
#Write Code from sklearn.metrics import roc_curve
#Write Code fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) plt.plot(fpr, tpr, label="ROC Curve") plt.xlabel("FPR") plt.ylabel("TPR (recall)") # Found the threshold closest to 0 close_zero = np.argmin(np.abs(thresholds)) plt.plot(fpr[close_zero], tpr[close_zero], 'o', markersize=10, label="threshold zero", fillstyle="none", c='k', mew=2) plt.legend(loc=4)
--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) ~\AppData\Local\Temp/ipykernel_8652/1348101448.py in <module> 1 #Write Code ----> 2 fpr, tpr, thresholds = roc_curve(y_test, lr.decision_function(X_test)) 3 plt.plot(fpr, tpr, label="ROC Curve") 4 plt.xlabel("FPR") 5 plt.ylabel("TPR (recall)") C:\ProgramData\Anaconda3\envs\py37\lib\site-packages\sklearn\linear_model\_base.py in decision_function(self, X) 268 X = check_array(X, accept_sparse='csr') 269 --> 270 n_features = self.coef_.shape[1] 271 if X.shape[1] != n_features: 272 raise ValueError("X has %d features per sample; expecting %d" AttributeError: 'list' object has no attribute 'shape'
#Write Code
#Write Code
Thinking 6
- How to draw ROC curve for multi classification problem
[Thinking] What information can you get from this OCR curve? What can this information do?