Implementation of random forest in sklearn

Random forest


Overview of integration algorithms

Ensemble learning is a machine learning algorithm that has attracted much attention. It is not a single machine learning algorithm, but builds the data model many times and integrates the modeling results of all models.

Goal of integration algorithm
The integration algorithm will consider the results of multiple models and summarize them to obtain a comprehensive result, so as to obtain better classification or regression performance than a single model.

Generally speaking, we have three integration algorithms: bagging, boosting and stacking

The bagging method establishes multiple independent evaluators, and then uses the average or majority voting principle to determine the result of evaluator integration. The representative model of bagging method is the decision tree.

In the lifting method, the base evaluator is related and constructed one by one in order. Its core idea is to combine the power of weak evaluator to predict the samples that are difficult to evaluate again and again, so as to form a strong evaluator. The representative models of lifting method include Adaboost and gradient lifting tree.

Integration algorithm in sklearn

classFunction of class
ensemble.AdaBoostClassifierAdaBoost classification
ensemble.AdaBoostRegressorAdaboost regression
ensemble.BaggingClassifierBagging classifier
ensemble.BaggingRegressorBagging regressor
ensemble.ExtraTreesClassifierExtra trees classification (hyper tree, extreme random tree)
ensemble.ExtraTreesRegressorExtra trees regression
ensemble.GradientBoostingClassifierGradient lifting classification
ensemble.GradientBoostingRegressorGradient lifting regression
ensemble.IsolationForestIsolated forest
ensemble.RandomForestClassifierRandom forest classification
ensemble.RandomForestRegressorRandom forest regression
ensemble.RandomTreesEmbeddingIntegration of completely random trees
ensemble.VotingClassifierSoft voting / majority rule classifier for inappropriate estimator


criterionThere are two measures of impurity, Gini coefficient and information entropy
max_depthThe maximum depth of the tree, and branches exceeding the maximum depth will be cut off
min_samples_leafEach child node of a node after branching must contain at least min_samples_leaf is a training sample, otherwise branching will not occur
min_samples_splitA node must contain at least min_samples_split training samples, this node is allowed to be branched, otherwise branching will not occur
max_featuresmax_features limits the number of features considered when branching. Features exceeding the limit will be discarded. The default value is the square of the total number of features
min_impurity_decreaseLimit the size of information gain. Branches with information gain less than the set value will not occur


This is the number of trees in the forest, that is, the number of base evaluators. The influence of this parameter on the accuracy of random forest model is monotonic.

  • Build a forest
    Import package
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine

Basic modeling process

  1. instantiation
  2. Training model, adjusting parameters
  3. Configuration interface
from sklearn.model_selection import train_test_split
#Select training set and test set

Xtrain, Xtest, Ytrain, Ytest = train_test_split(,,test_size=0.3)

clf = DecisionTreeClassifier(random_state=0)
rfc = RandomForestClassifier(random_state=0)

#Training model
clf =,Ytrain)
rfc =,Ytrain)

#View model accuracy
score_c = clf.score(Xtest,Ytest)
score_r = rfc.score(Xtest,Ytest)

print("Single Tree:{}".format(score_c)
     ,"Random Forest:{}".format(score_r)

See the effect comparison of random forest and decision tree under ten groups of cross validation

rfc_l = []
clf_l = []

for i in range(10):
   rfc = RandomForestClassifier(n_estimators=25)
   rfc_s = cross_val_score(rfc,,,cv=10).mean()
   clf = DecisionTreeClassifier()
   clf_s = cross_val_score(clf,,,cv=10).mean()
plt.plot(range(1,11),rfc_l,label = "Random Forest")
plt.plot(range(1,11),clf_l,label = "Decision Tree")


Random forest also has random_ The usage of state is similar to that in the classification tree, except that in the classification tree, there is a random_state only controls the generation of a tree, and random in a random forest_ State controls the mode of forest generation, rather than having only one tree in a forest.

rfc = RandomForestClassifier(n_estimators=20,random_state=2)
rfc =, Ytrain)
#One of the important properties of random forest: estimators, to view the status of trees in the forest
for i in range(len(rfc.estimators_)):

bootstrap & oob_score

To make the base classifiers as different as possible, an easy to understand method is to use different training sets for training, and the bagged method forms different training data through the random sampling technology with return. bootstrap is used to control the parameters of the sampling technology.

In an original training set containing n samples, we conduct random sampling, one sample at a time, and put the sample back to the original training set before taking the next sample, that is, the sample may still be collected at the next sampling. In this way, we collect n times, and finally get a self-service set composed of n samples as large as the original training set. Due to random sampling, the self-service set is different from the original data set and other sampling sets. In this way, we can freely create inexhaustible and different self-help sets. Using these self-help sets to train our base classifiers, our base classifiers will naturally be different.

#There is no need to divide the training set and the test set
rfc = RandomForestClassifier(n_estimators=25,oob_score=True)
rfc =,
#Important attribute oob_score_


Four important parameters:
n_estimators´╝îrandom_state, boost and oob_score these four parameters help us understand the basic process and important concepts of bagging method.
Two important attributes:
. estimators_ And oob_score_
In addition to these two attributes, as an integrated algorithm of tree model, random forest naturally has feature_importances_ This property.


All parameters, attributes and interfaces are consistent with the random forest classifier. The only difference is that the regression tree is different from the classification tree, and the impure index and parameter Criterion are inconsistent.


Regression tree is an indicator of branch quality, and supports three standards:

  1. Enter "mse" to use the mean square error (mse). The difference between the mean square error between the parent node and the leaf node will be used as the criterion for feature selection. This method minimizes L2 loss by using the mean value of the leaf node
  2. Enter "friedman_mse" to use Feldman mean square error, which uses Friedman's improved mean square error for problems in potential branching
  3. Enter "MAE" to use the absolute mean error MAE (mean absolute error), which uses the median of leaf nodes to minimize L1 loss.

In regression, what we pursue is that the smaller the MSE, the better.

Important attributes and interfaces

The most important attributes and interfaces are consistent with the classifier of random forest, and apply, fit, predict and score are the core. It is worth mentioning that random forest regression does not predict_proba is an interface, because for regression, there is no probability that a sample will be divided into a certain category, so there is no predict_proba this interface.

from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

boston = load_boston()

regressor = RandomForestRegressor(n_estimators=100,random_state=0)
cross_val_score(regressor,,, cv=10
               ,scoring = "neg_mean_squared_error")

Tags: Machine Learning sklearn

Posted by Chinese on Wed, 15 Dec 2021 17:03:09 +1030