The reference code is from: Explore and run machine learning code with Kaggle Notebooks | Using data from Two Sigma Connect: Rental Listing Inquirieshttps://www.kaggle.com/code/sudalairajkumar/xgb-starter-in-python/notebook
First contact with xgboost, start your own exploration on the shoulders of predecessors ~The code in this article is slightly adjusted on the reference code, not necessarily correct, because the device can not copy their full code over, just make a record of your own learning (summary, no code)
1. Project Background
Under the situation of college students'loan, there are two situations: timely repayment (good debt) or default (bad debt). To make a judgment on the lender, predict whether there will be a bad debt situation.
2. Data acquisition
There are already two pieces of data, train and predict. The dataset train is the training set, predict is the test set, and is the last dataset to predict the score.
3. Exploratory Data Analysis
This reference document is useful:
Exploratory analysis of Two Sigma Connect: Rental Listing Inquiries — pydata
value_counts / data.describe() statistics. Some missing values were found to be too large [Look at missing values: df.isnull().sum()]/Distribution anomalies.
There are numeric data and categories category data in the dataset. (
4. Data cleaning
1) The customer id does not contain valid information and is deleted. In some scenarios, the customer id of the dataset contains valid features, but in this scenario, the id is only a number and is not useful.
2) shuffle the data to eliminate potential continuous numerical effects.
3) Data missing handling, fillna or sklearn. Impute. The SimpleImputer method is populated. Some of the missing eigenvalues exceed 70%, deleting this feature.
4) The numerical characteristics need not be processed. strip and so on to remove spaces or characters before and after values.
5) Category features can be converted to numerical features using one-hot exclusive heat coding or labelencoder methods. The xgboost taxonomy does not apply to onehot encoding (populated later), and I use labelencoder encoding here.
5. Feature Engineering
#Read data in DataFrame format train_df = pd.read_csv(train_file) predict_df = pd.read_csv(predict_file) print(train_df.shape) print(predict_df.shape)
Preservation of numeric characteristics and conversion of non-numeric features: Non-numeric variables with additional information are converted to corresponding values. Delete sparse features: columns with a large proportion of empty values/the same value. Filling in empty values with a median tends to preserve sorting relationships better than the mean when the data distribution is not symmetrical.
#feature_to_use is the feature that will be used after feature processing features_to_use = ["feature_name", "feature_name", "feature_name", "feature_name"]
Establish feature and label data for data set splitting. In the dataset, the customer number does not overlap in the training set and the test set, nor does it involve the branch number. It is of little significance and is deleted.
xgb cannot handle classification features, and labelencoder codes encode the features as numeric values. Because xgb is insensitive to missing values, the processing of missing values is deleted. Note that some features, while seemingly numeric, are of type string because they contain characters before and after. Note the discriminatory processing.
#Convert to numerical characteristics categorical = ["feature1", "feature2", "feature3"] for f in categorical: if train_df[f].dtype=='object': #print(f) lbl = preprocessing.LabelEncoder() lbl.fit(list(train_df[f].values) + list(prdict_df[f].values)) train_df[f] = lbl.transform(list(train_df[f].values)) test_df[f] = lbl.transform(list(predict_df[f].values)) features_to_use.append(f)
Some of the class features are in string format, but they contain class number features (such as 4_8-90,000, 5_9-100,000), split divides the number as a feature, and adds feature_to the name To_ In the use list, delete the rest.
features_to_use = ["feature1","feature2","feature3","feature4","feature5",......]
(In fact, I ran the full data once after I built the baseline model)
Separate the feature of train and predict dataset from label to get train_X (Training Set Features), train_y (dataset label), test_X (Test Set Features), test_y (Test Set Label).
6. Building xgboost classification model
Several elements: param parameter setting, binary classification objective objective function set to'multi:softprob'(output classification probability) or'multi:softmax' (output classification result). Generally speaking, the default threshold for output classification is 0.5, but this can certainly be changed (uhuh)
eta: learning rate, step size of each iteration
max_depth: The maximum depth of the tree (tomark), the larger, the more specific model learning
num_class: Number of classes, set here to 3 but no problem, xgb itself will be set to 2
eval_metric: evaluation function, evals is used for evaluation operation, no training, only output model evaluation effect
The parameter difference between multiclass and binary classification is loss function and evaluation function, and multiclass is replaced by'multiclass'and'multi_logloss', you must also specify the number of categories:'num_class'.
xgboost has its own special data format, DMatrix, which is used faster in the native interface style. If you want to see the evaluation indicators during the training process, you can use the watchlist.
There is also sklearn interface style:
model=XGBClassifier(**params)
sklearn interface style uses fit() method, and eval_can be used to print out evaluation values during training. Set = [(X, y)].
runXGB function. Define parameters, training.
#Set up xgboost run function, enter training set, test set #Set parameters, evaluate function is mlogloss, target function is multi:softprob def runXGB(train_X, train_y, test_X, test_y=None, feature_names=None, seed_val=0, num_rounds=1000): param = {} #The difference between multi:softprob and multi:softmax is that #The former outputs the probability for each category, and the latter outputs the classification label results, which are the same when the probability threshold is 50%. param['objective'] = 'multi:softprob' param['eta'] = 0.1 param['max_depth'] = 6 param['silent'] = 1 param['num_class'] = 3 param['eval_metric'] = "mlogloss" param['min_child_weight'] = 1 param['subsample'] = 0.7 param['colsample_bytree'] = 0.7 param['seed'] = seed_val num_rounds = num_rounds plst = list(param.items()) xgtrain = xgb.DMatrix(train_X, label=train_y) if test_y is not None: xgtest = xgb.DMatrix(test_X, label=test_y) watchlist = [ (xgtrain,'train'), (xgtest, 'test') ] model = xgb.train(plst, xgtrain, num_rounds, watchlist, early_stopping_rounds=20) else: xgtest = xgb.DMatrix(test_X) model = xgb.train(plst, xgtrain, num_rounds) pred_test_y = model.predict(xgtest) return pred_test_y, model
k-fold cross validation. Take out a five-fold index in the training set and build small samples of training and validation sets respectively
#k-fold cross-validation and 5-fold training for unbalanced sample sets cv_scores = [] kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2016) for dev_index, val_index in kf.split(range(train_X.shape[0])): dev_X, val_X = train_X.iloc[dev_index], train_X.iloc[val_index] dev_y, val_y = train_y.iloc[dev_index], train_y.iloc[val_index] preds, model = runXGB(dev_X, dev_y, val_X, val_y) cv_scores.append(log_loss(val_y, preds)) print(cv_scores)
Predict, save results
preds, model = runXGB(train_X, train_y, test_X, num_rounds=400) out_df = pd.DataFrame(preds) out_df.columns = ["class1","class2"] out_df["listing_id"] = test_df.listing_id.values out_df.to_csv("xgb_starter2.csv", index=False)
7. Model evaluation
Because of the imbalance of the sample dataset, the ratio of label Y to label N is 1:10. auc also has up to 90% of Y label classifications that are incorrect, so confusion matrices and related indicators (TN/TP/FN/FP/recall/...) are used as evaluation methods and feature_importances_are output to see the importance of features. In practice, it is necessary to have as few classifications as possible and as few bad people as possible to be predicted as good people.
8. Model optimization
GridSearchCV.grid_scores_ And mean_validation_score error_ allein_STR's Blog - CSDN Blog
Processing unbalanced datasets.
Data resampling: undersampling, oversampling, scale_pos_weight (adjusting small-scale sample weights, equivalent to oversampling)
straightkfold - k fold cross validation. The training set is divided into k-1 training set and 1 test set, and the training model is built.
cv_scores = [] kf = model_selection.KFold(n_splits=5, shuffle=True, random_state=2016) for dev_index, val_index in kf.split(range(train_X.shape[0])): dev_X, val_X = train_X[dev_index,:], train_X[val_index,:] dev_y, val_y = train_y[dev_index], train_y[val_index] preds, model = runXGB(dev_X, dev_y, val_X, val_y) cv_scores.append(log_loss(val_y, preds)) print(cv_scores) break
Parameter adjustment: grid search/random search/Bayesian optimization/
General methods for tuning parameters:
1. Select a relatively high learning rate and set a parameter adjustment estimate. Generally speaking, the learning rate is set to 0.1. However, for different problems, the learning rate can be set at 0.05-0.3.
2. When the learning rate is determined, adjust some specific parameters of the tree. For example: max_depth, min_child_weight, gamma, subsample, colsample_bytree
3. Adjust the regularization parameters, such as lambda, alpha. This is mainly to reduce model complexity and increase the speed of operation. Reduce overfitting appropriately.
4. Reduce the learning rate and choose the best parameters
param_test1 = { 'max_depth':list(range(3,10,2)), 'min_child_weight':list(range(1,6,2)) } gsearch1 = GridSearchCV(estimator = XGBClassifier( learning_rate =0.1, n_estimators=20, max_depth=5, min_child_weight=1, gamma=0, subsample=0.8, colsample_bytree=0.8, objective= 'binary:logistic', nthread=4, scale_pos_weight=1, seed=27), param_grid = param_test1, scoring='roc_auc',n_jobs=4,iid=False, cv=5) gsearch1.fit(train[predictors],train[target])
scores = gsearch1.cv_results_['mean_test_score'] best_param = gsearch1.best_params_ best_score = gsearch1.best_score_
Output tuning results and optimal parameters, refer to this:
GridSearchCV.grid_scores_ And mean_validation_score error_ allein_STR's Blog - Programmer Secret - Programmer Secret
As a result, there was a 0.01 increase in F1, an increase in recall rates, and a decrease in accuracy.
Tuning can only be said as a icing on the cake, or it depends on factors such as feature engineering.
9. Other:
I also tried lightGBM.
Much like XGB, there are differences in some parameters and data formats (Dataset format).
Introduction to lightgbm parameters_ Blog on Line 1 along the way - CSDN Blog_ lightgbm parameter
Feature Engineering & LightGBM | Kaggle
[lightgbm, xgboost, nn code collation 1] lightgbm does two-class, multi-class, and regression tasks (including python source)_ QLMX's Blog - CSDN Blog_ Lightgbm regression codeMachine Learning: lightgbm (Actual: Classification & Regression) - Digging
lightgbm Classification - MiQing4in - Blog Park
By the way, this is a replication of the kaggle xgb code, which is well written and can be used as a reading material: kaggle Payment Anti-Fraud: IEEE-CIS Fraud Detection First Scenario Reproduction Process (Coded) - Knowledgeable
Trenched pits:
There is no shuffee in the data, and continuous variables have too much influence
Not updated to dataframe after filling and replacing features
The feature imbalance evaluation function cannot look at auc
imputer fills in missing values
labelencoder does not handle NAN value methods
lgb data format is different from xgb
Different interface styles