Random forest
summary
Overview of integration algorithms
Ensemble learning is a machine learning algorithm that has attracted much attention. It is not a single machine learning algorithm, but builds the data model many times and integrates the modeling results of all models.
Goal of integration algorithm |
---|
The integration algorithm will consider the results of multiple models and summarize them to obtain a comprehensive result, so as to obtain better classification or regression performance than a single model. |
Generally speaking, we have three integration algorithms: bagging, boosting and stacking
The bagging method establishes multiple independent evaluators, and then uses the average or majority voting principle to determine the result of evaluator integration. The representative model of bagging method is the decision tree.
In the lifting method, the base evaluator is related and constructed one by one in order. Its core idea is to combine the power of weak evaluator to predict the samples that are difficult to evaluate again and again, so as to form a strong evaluator. The representative models of lifting method include Adaboost and gradient lifting tree.
Integration algorithm in sklearn
class | Function of class |
---|---|
ensemble.AdaBoostClassifier | AdaBoost classification |
ensemble.AdaBoostRegressor | Adaboost regression |
ensemble.BaggingClassifier | Bagging classifier |
ensemble.BaggingRegressor | Bagging regressor |
ensemble.ExtraTreesClassifier | Extra trees classification (hyper tree, extreme random tree) |
ensemble.ExtraTreesRegressor | Extra trees regression |
ensemble.GradientBoostingClassifier | Gradient lifting classification |
ensemble.GradientBoostingRegressor | Gradient lifting regression |
ensemble.IsolationForest | Isolated forest |
ensemble.RandomForestClassifier | Random forest classification |
ensemble.RandomForestRegressor | Random forest regression |
ensemble.RandomTreesEmbedding | Integration of completely random trees |
ensemble.VotingClassifier | Soft voting / majority rule classifier for inappropriate estimator |
RandomForestClassifier
parameter | meaning |
---|---|
criterion | There are two measures of impurity, Gini coefficient and information entropy |
max_depth | The maximum depth of the tree, and branches exceeding the maximum depth will be cut off |
min_samples_leaf | Each child node of a node after branching must contain at least min_samples_leaf is a training sample, otherwise branching will not occur |
min_samples_split | A node must contain at least min_samples_split training samples, this node is allowed to be branched, otherwise branching will not occur |
max_features | max_features limits the number of features considered when branching. Features exceeding the limit will be discarded. The default value is the square of the total number of features |
min_impurity_decrease | Limit the size of information gain. Branches with information gain less than the set value will not occur |
n_astimators
This is the number of trees in the forest, that is, the number of base evaluators. The influence of this parameter on the accuracy of random forest model is monotonic.
- Build a forest
Import package
%matplotlib inline from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_wine
Basic modeling process
- instantiation
- Training model, adjusting parameters
- Configuration interface
from sklearn.model_selection import train_test_split #Select training set and test set Xtrain, Xtest, Ytrain, Ytest = train_test_split(wine.data,wine.target,test_size=0.3) #instantiation clf = DecisionTreeClassifier(random_state=0) rfc = RandomForestClassifier(random_state=0) #Training model clf = clf.fit(Xtrain,Ytrain) rfc = rfc.fit(Xtrain,Ytrain) #View model accuracy score_c = clf.score(Xtest,Ytest) score_r = rfc.score(Xtest,Ytest) print("Single Tree:{}".format(score_c) ,"Random Forest:{}".format(score_r) )
See the effect comparison of random forest and decision tree under ten groups of cross validation
rfc_l = [] clf_l = [] for i in range(10): rfc = RandomForestClassifier(n_estimators=25) rfc_s = cross_val_score(rfc,wine.data,wine.target,cv=10).mean() rfc_l.append(rfc_s) clf = DecisionTreeClassifier() clf_s = cross_val_score(clf,wine.data,wine.target,cv=10).mean() clf_l.append(clf_s) plt.plot(range(1,11),rfc_l,label = "Random Forest") plt.plot(range(1,11),clf_l,label = "Decision Tree") plt.legend() plt.show()
random_state
Random forest also has random_ The usage of state is similar to that in the classification tree, except that in the classification tree, there is a random_state only controls the generation of a tree, and random in a random forest_ State controls the mode of forest generation, rather than having only one tree in a forest.
rfc = RandomForestClassifier(n_estimators=20,random_state=2) rfc = rfc.fit(Xtrain, Ytrain) #One of the important properties of random forest: estimators, to view the status of trees in the forest rfc.estimators_[0].random_state for i in range(len(rfc.estimators_)): print(rfc.estimators_[i].random_state)
bootstrap & oob_score
To make the base classifiers as different as possible, an easy to understand method is to use different training sets for training, and the bagged method forms different training data through the random sampling technology with return. bootstrap is used to control the parameters of the sampling technology.
In an original training set containing n samples, we conduct random sampling, one sample at a time, and put the sample back to the original training set before taking the next sample, that is, the sample may still be collected at the next sampling. In this way, we collect n times, and finally get a self-service set composed of n samples as large as the original training set. Due to random sampling, the self-service set is different from the original data set and other sampling sets. In this way, we can freely create inexhaustible and different self-help sets. Using these self-help sets to train our base classifiers, our base classifiers will naturally be different.
#There is no need to divide the training set and the test set rfc = RandomForestClassifier(n_estimators=25,oob_score=True) rfc = rfc.fit(wine.data,wine.target) #Important attribute oob_score_ rfc.oob_score_
Summary
Four important parameters:
n_estimators,random_state, boost and oob_score these four parameters help us understand the basic process and important concepts of bagging method.
Two important attributes:
. estimators_ And oob_score_
In addition to these two attributes, as an integrated algorithm of tree model, random forest naturally has feature_importances_ This property.
RandomForestRegreessor
All parameters, attributes and interfaces are consistent with the random forest classifier. The only difference is that the regression tree is different from the classification tree, and the impure index and parameter Criterion are inconsistent.
criterion
Regression tree is an indicator of branch quality, and supports three standards:
- Enter "mse" to use the mean square error (mse). The difference between the mean square error between the parent node and the leaf node will be used as the criterion for feature selection. This method minimizes L2 loss by using the mean value of the leaf node
- Enter "friedman_mse" to use Feldman mean square error, which uses Friedman's improved mean square error for problems in potential branching
- Enter "MAE" to use the absolute mean error MAE (mean absolute error), which uses the median of leaf nodes to minimize L1 loss.
In regression, what we pursue is that the smaller the MSE, the better.
Important attributes and interfaces
The most important attributes and interfaces are consistent with the classifier of random forest, and apply, fit, predict and score are the core. It is worth mentioning that random forest regression does not predict_proba is an interface, because for regression, there is no probability that a sample will be divided into a certain category, so there is no predict_proba this interface.
from sklearn.datasets import load_boston from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor boston = load_boston() regressor = RandomForestRegressor(n_estimators=100,random_state=0) cross_val_score(regressor, boston.data, boston.target, cv=10 ,scoring = "neg_mean_squared_error") sorted(sklearn.metrics.SCORERS.keys())