xgboost bad debt forecast (two-class problem)

The reference code is from: Explore and run machine learning code with Kaggle Notebooks | Using data from Two Sigma Connect: Rental Listing Inquirieshttps://www.kaggle.com/code/sudalairajkumar/xgb-starter-in-python/notebook

First contact with xgboost, start your own exploration on the shoulders of predecessors ~The code in this article is slightly adjusted on the reference code, not necessarily correct, because the device can not copy their full code over, just make a record of your own learning (summary, no code)

1. Project Background

Under the situation of college students'loan, there are two situations: timely repayment (good debt) or default (bad debt). To make a judgment on the lender, predict whether there will be a bad debt situation.

2. Data acquisition

There are already two pieces of data, train and predict. The dataset train is the training set, predict is the test set, and is the last dataset to predict the score.

3. Exploratory Data Analysis

This reference document is useful:

Exploratory analysis of Two Sigma Connect: Rental Listing Inquiries — pydata

value_counts / data.describe() statistics. Some missing values were found to be too large [Look at missing values: df.isnull().sum()]/Distribution anomalies.

There are numeric data and categories category data in the dataset. (

4. Data cleaning

1) The customer id does not contain valid information and is deleted. In some scenarios, the customer id of the dataset contains valid features, but in this scenario, the id is only a number and is not useful.

2) shuffle the data to eliminate potential continuous numerical effects.

3) Data missing handling, fillna or sklearn. Impute. The SimpleImputer method is populated. Some of the missing eigenvalues exceed 70%, deleting this feature.

4) The numerical characteristics need not be processed. strip and so on to remove spaces or characters before and after values.

5) Category features can be converted to numerical features using one-hot exclusive heat coding or labelencoder methods. The xgboost taxonomy does not apply to onehot encoding (populated later), and I use labelencoder encoding here.

5. Feature Engineering

#Read data in DataFrame format

train_df = pd.read_csv(train_file)
predict_df = pd.read_csv(predict_file)
print(train_df.shape)
print(predict_df.shape)

Preservation of numeric characteristics and conversion of non-numeric features: Non-numeric variables with additional information are converted to corresponding values. Delete sparse features: columns with a large proportion of empty values/the same value. Filling in empty values with a median tends to preserve sorting relationships better than the mean when the data distribution is not symmetrical.

#feature_to_use is the feature that will be used after feature processing
features_to_use  = ["feature_name", "feature_name", "feature_name", "feature_name"]

Establish feature and label data for data set splitting. In the dataset, the customer number does not overlap in the training set and the test set, nor does it involve the branch number. It is of little significance and is deleted.

xgb cannot handle classification features, and labelencoder codes encode the features as numeric values. Because xgb is insensitive to missing values, the processing of missing values is deleted. Note that some features, while seemingly numeric, are of type string because they contain characters before and after. Note the discriminatory processing.

#Convert to numerical characteristics
categorical = ["feature1", "feature2", "feature3"]
for f in categorical:
        if train_df[f].dtype=='object':
            #print(f)
            lbl = preprocessing.LabelEncoder()
            lbl.fit(list(train_df[f].values) + list(prdict_df[f].values))
            train_df[f] = lbl.transform(list(train_df[f].values))
            test_df[f] = lbl.transform(list(predict_df[f].values))
            features_to_use.append(f)
    

Some of the class features are in string format, but they contain class number features (such as 4_8-90,000, 5_9-100,000), split divides the number as a feature, and adds feature_to the name To_ In the use list, delete the rest.

features_to_use = ["feature1","feature2","feature3","feature4","feature5",......]

(In fact, I ran the full data once after I built the baseline model)

Separate the feature of train and predict dataset from label to get train_X (Training Set Features), train_y (dataset label), test_X (Test Set Features), test_y (Test Set Label).

6. Building xgboost classification model

Several elements: param parameter setting, binary classification objective objective function set to'multi:softprob'(output classification probability) or'multi:softmax' (output classification result). Generally speaking, the default threshold for output classification is 0.5, but this can certainly be changed (uhuh)

eta: learning rate, step size of each iteration

max_depth: The maximum depth of the tree (tomark), the larger, the more specific model learning

num_class: Number of classes, set here to 3 but no problem, xgb itself will be set to 2

eval_metric: evaluation function, evals is used for evaluation operation, no training, only output model evaluation effect

The parameter difference between multiclass and binary classification is loss function and evaluation function, and multiclass is replaced by'multiclass'and'multi_logloss', you must also specify the number of categories:'num_class'.

xgboost has its own special data format, DMatrix, which is used faster in the native interface style. If you want to see the evaluation indicators during the training process, you can use the watchlist.

There is also sklearn interface style:

XGBoost Classifier for Machine Learning XGBClassifier--xgb uses sklearn interface_ Mulliessen's Blog-CSDN Blog_ XGB classification

model=XGBClassifier(**params)

sklearn interface style uses fit() method, and eval_can be used to print out evaluation values during training. Set = [(X, y)].

runXGB function. Define parameters, training.

#Set up xgboost run function, enter training set, test set
#Set parameters, evaluate function is mlogloss, target function is multi:softprob
def runXGB(train_X, train_y, test_X, test_y=None, feature_names=None, seed_val=0, num_rounds=1000):
    param = {}
    #The difference between multi:softprob and multi:softmax is that
    #The former outputs the probability for each category, and the latter outputs the classification label results, which are the same when the probability threshold is 50%.
    param['objective'] = 'multi:softprob'
    param['eta'] = 0.1
    param['max_depth'] = 6
    param['silent'] = 1
    param['num_class'] = 3
    param['eval_metric'] = "mlogloss"
    param['min_child_weight'] = 1
    param['subsample'] = 0.7
    param['colsample_bytree'] = 0.7
    param['seed'] = seed_val
    num_rounds = num_rounds

    plst = list(param.items())
    xgtrain = xgb.DMatrix(train_X, label=train_y)

    if test_y is not None:
        xgtest = xgb.DMatrix(test_X, label=test_y)
        watchlist = [ (xgtrain,'train'), (xgtest, 'test') ]
        model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=20)
    else:
        xgtest = xgb.DMatrix(test_X)
        model = xgb.train(plst, xgtrain, num_rounds)

    pred_test_y = model.predict(xgtest)
    return pred_test_y, model

k-fold cross validation. Take out a five-fold index in the training set and build small samples of training and validation sets respectively

#k-fold cross-validation and 5-fold training for unbalanced sample sets

cv_scores = []
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2016)
for dev_index, val_index in kf.split(range(train_X.shape[0])):
        dev_X, val_X = train_X.iloc[dev_index], train_X.iloc[val_index]
        dev_y, val_y = train_y.iloc[dev_index], train_y.iloc[val_index]
        preds, model = runXGB(dev_X, dev_y, val_X, val_y)
        cv_scores.append(log_loss(val_y, preds))
print(cv_scores)

Predict, save results

preds, model = runXGB(train_X, train_y, test_X, num_rounds=400)
out_df = pd.DataFrame(preds)
out_df.columns = ["class1","class2"]
out_df["listing_id"] = test_df.listing_id.values
out_df.to_csv("xgb_starter2.csv", index=False)

7. Model evaluation

xgboost multi-classification: objective parameter (reg:linear,multi:softmax,multi:softprob) comparative analysis_ phyllisyuell's Blog-CSDN Blog_ Reg:linear

Because of the imbalance of the sample dataset, the ratio of label Y to label N is 1:10. auc also has up to 90% of Y label classifications that are incorrect, so confusion matrices and related indicators (TN/TP/FN/FP/recall/...) are used as evaluation methods and feature_importances_are output to see the importance of features. In practice, it is necessary to have as few classifications as possible and as few bad people as possible to be predicted as good people.

8. Model optimization

GridSearchCV.grid_scores_ And mean_validation_score error_ allein_STR's Blog - CSDN Blog 

Processing unbalanced datasets.

Data resampling: undersampling, oversampling, scale_pos_weight (adjusting small-scale sample weights, equivalent to oversampling)

straightkfold - k fold cross validation. The training set is divided into k-1 training set and 1 test set, and the training model is built.

Xgb. Cross-validation in CV and sklearn: Compare XGBoost. Cross-validation in CV and sklearn_ schdut's blog, CSDN blog, was written on the front: It's been a long time since we blogged, a little guilty and a little sad, oh... XGBoost has two interfaces: native, such as xgboost.train, xgboost.cvsklearn interface, such as xgboost.XGBClassifier, xgboost. There are slightly different XGBRegressor interfaces, such as eta for native interface and learning_for sklearn interface. Rate, native interface to tr...https://blog.csdn.net/shengchaohua163/article/details/105100151

cv_scores = []
kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2016)
for dev_index, val_index in kf.split(range(train_X.shape[0])):
        dev_X, val_X = train_X[dev_index,:], train_X[val_index,:]
        dev_y, val_y = train_y[dev_index], train_y[val_index]
        preds, model = runXGB(dev_X, dev_y, val_X, val_y)
        cv_scores.append(log_loss(val_y, preds))
        print(cv_scores)
        break

Parameter adjustment: grid search/random search/Bayesian optimization/

General methods for tuning parameters:

  Complete guide and Practice for xgboost parameter tuning Bubbling Lulu Blog-CSDN Blog_ xgboost parameter tuning

Xgboost modeling, sklearn evaluation, classification problems using confusion matrix, regression problems using MSE_ Yuanyuan Park's Blog - CSDN Blog

1. Select a relatively high learning rate and set a parameter adjustment estimate. Generally speaking, the learning rate is set to 0.1. However, for different problems, the learning rate can be set at 0.05-0.3.
2. When the learning rate is determined, adjust some specific parameters of the tree. For example: max_depth, min_child_weight, gamma, subsample, colsample_bytree
3. Adjust the regularization parameters, such as lambda, alpha. This is mainly to reduce model complexity and increase the speed of operation. Reduce overfitting appropriately.
4. Reduce the learning rate and choose the best parameters

param_test1 = {
 'max_depth':list(range(3,10,2)),
 'min_child_weight':list(range(1,6,2))
}
gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=20, max_depth=5,
 min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8,
 objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), 
 param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5)
gsearch1.fit(train[predictors],train[target])
scores = gsearch1.cv_results_['mean_test_score']
best_param = gsearch1.best_params_
best_score = gsearch1.best_score_

Output tuning results and optimal parameters, refer to this:

GridSearchCV.grid_scores_ And mean_validation_score error_ allein_STR's Blog - Programmer Secret - Programmer Secret
As a result, there was a 0.01 increase in F1, an increase in recall rates, and a decrease in accuracy.

Tuning can only be said as a icing on the cake, or it depends on factors such as feature engineering.
 

9. Other:

I also tried lightGBM.

Much like XGB, there are differences in some parameters and data formats (Dataset format).

Introduction to lightgbm parameters_ Blog on Line 1 along the way - CSDN Blog_ lightgbm parameter

Feature Engineering & LightGBM | Kaggle

[lightgbm, xgboost, nn code collation 1] lightgbm does two-class, multi-class, and regression tasks (including python source)_ QLMX's Blog - CSDN Blog_ Lightgbm regression codeMachine Learning: lightgbm (Actual: Classification & Regression) - Digging

lightgbm Classification - MiQing4in - Blog Park

lgb Summary - know

By the way, this is a replication of the kaggle xgb code, which is well written and can be used as a reading material: kaggle Payment Anti-Fraud: IEEE-CIS Fraud Detection First Scenario Reproduction Process (Coded) - Knowledgeable

Trenched pits:

There is no shuffee in the data, and continuous variables have too much influence

Not updated to dataframe after filling and replacing features

The feature imbalance evaluation function cannot look at auc

imputer fills in missing values

labelencoder does not handle NAN value methods

lgb data format is different from xgb

Different interface styles

Tags: sklearn

Posted by hungryOrb on Mon, 18 Apr 2022 02:09:59 +0930