VS2015 usage record of LIBSVM
LIBSVM The toolbox is a simple and easy-to-use SVM pattern recognition and regression machine software package developed by C.JLin and others of National Taiwan University. The software package uses the results of convergence proof to improve the algorithm and achieves good results. The classification and use of LIBSVM(Version 3.25) in VS2015 are recorded below.
1. Use process
(1). Prepare the data set in the format required by LIBSVM (you can also prepare the data format yourself, and you need to write the function to obtain the data yourself. The format of LIBSVM official data is taken as an example in the record);
(2). Perform simple scaling operation on data;
(3). Consider choosing RBF(radial basis function) kernel parameters;
(4). If RBF is selected, the optimal parameters C and gamma are obtained through cross validation;
(5). The optimal parameters C and g are used to train the whole training set to obtain the support vector machine model;
(6). Test and predict with the obtained model.
2. Introduction to data format
Download the file package on the LIBSVM official website. We mainly use svm.h and SVM CPP files, and put them into my own project. When I use them, I encapsulate the relevant functions into a classcationsvm class.
In LIBSVM, the type related to reading feature files is svm_problem, which mainly records the imported data during training and prediction. There are three elements in this class, as follows:
struct svm_problem { int n; //Record the total number of samples double *y; //Record the category of the sample struct svm_node **x; //Store the features of all samples, a two-dimensional array, and store all the features of one sample per line };
Where: SVM_ The node type is defined as follows:
struct svm_node //Used to store individual features in the input space { int index; //The dimension number of the feature in the feature space double value; //The value of the feature };
In this recording process, I used the official vowel Data set: the data set contains 11 categories, each category contains 48 data, and a total of 528 data are used for model training. The official data format is:
Classification No. 1: data 1 2: data 2 3: data 3
The functions read from the txt file into the array are as follows:
//Read data from official documents void ClassificationSVM::readTxt2(const std::string& featureFileName) { dataVec.clear();//dataVec is a two-dimensional array, corresponding to SVM_ Data in x of problem labels.clear();//labels record the classification corresponding to each data, integer array featureDim = -1;//Characteristic quantity record sampleNum = 0;//Number of samples //Official standard style std::ifstream fin; std::string rowData;//One line content std::istringstream iss; fin.open(featureFileName); //Save feature data std::string dataVal; while (std::getline(fin, rowData)) { iss.clear(); iss.str(rowData); bool first = true; std::vector<double>rowDataVec; // Read word by word and traverse each word in each line while (iss >> dataVal) { //The first data is the label classification ID if (first) { first = false; labels.push_back(atof(dataVal.c_str())); sampleNum++; } else { //Split the string to get the data after colon for (int k = 0;k < dataVal.size();k++) { if (dataVal[k] == ':') { dataVal = dataVal.substr(k+1); break; } } rowDataVec.push_back(atof(dataVal.c_str())); } } dataVec.push_back(rowDataVec); } featureDim = dataVec[0].size(); }
3. Data scaling
When scaling the input data, the original data range may be too large or too small. This process can rescale the data to an appropriate range to make the training and prediction faster. Generally, it is scaled to [0, 1] or [- 1, 1]. Here I scale to [- 1, 1]. The scaling formula is as follows:,
y
′
=
lower
+
(
upper
−
lower
)
∗
y
−
min
max
−
min
y^{\prime}=\text { lower }+(\text { upper }-\text { lower }) * \frac{y-\min }{\max -\min }
y′= lower +( upper − lower )∗max−miny−min
//Normalized to [- 1, 1], it is divided into scaling during training, and a scaled file shall be written; Read this scaling file during prediction void ClassificationSVM::svmScale(bool train_model) { double *minVals = new double[featureDim]; double *maxVals = new double[featureDim]; if (train_model) { for (int i = 0;i < featureDim;i++) { minVals[i] = dataVec[0][i]; maxVals[i] = dataVec[0][i]; } for (int i = 0;i < dataVec.size();i++) { for (int j = 0;j < dataVec[i].size();j++) { if (dataVec[i][j] < minVals[j]) minVals[j] = dataVec[i][j]; if (dataVec[i][j] > maxVals[j]) maxVals[j] = dataVec[i][j]; } } //The zoom file stores the maximum and minimum values of each feature std::ofstream out("scale_params.txt"); for (int i = 0;i < featureDim;i++) { out << minVals[i] << " "; } out << std::endl; for (int i = 0;i < featureDim;i++) { out << maxVals[i] << " "; } } else { std::ifstream fin; std::string rowData;//One line content std::istringstream iss; fin.open("scale_params.txt"); std::getline(fin, rowData); iss.clear(); iss.str(rowData); double dataVal; int count = 0; // Read word by word and traverse each word in each line while (iss >> dataVal) { minVals[count] = dataVal; count++; } count = 0; std::getline(fin, rowData); iss.clear(); iss.str(rowData); while (iss >> dataVal) { maxVals[count] = dataVal; count++; } } for (int i = 0;i < dataVec.size();i++) { for (int j = 0;j < dataVec[i].size();j++) { dataVec[i][j] = -1 + 2 * (dataVec[i][j] - minVals[j]) / (maxVals[j] - minVals[j]); } } delete minVals; delete maxVals; }
After scaling, we can construct the imported parameters dataVec and labels into the official structure svm_problem.
//Set prob. The definition of prob is: svm_problem prob; prob.l = sampleNum; //Number of training samples prob.x = new svm_node*[sampleNum]; //Characteristic matrix prob.y = new double[sampleNum]; //Label matrix for (int i = 0; i < sampleNum; ++i) { prob.x[i] = new svm_node[featureDim + 1]; // for (int j = 0; j < featureDim; ++j) { prob.x[i][j].index = j + 1; prob.x[i][j].value = dataVec[i][j]; } prob.x[i][featureDim].index = -1; prob.y[i] = labels[i]; }
4. Cross validation
Cross validation (Cross Validation) is a statistical analysis method used to verify the performance of the classifier. The basic idea is to group the original data (datasets) in a certain sense, one part as a training set and the other part as a validation set. First, the classifier is trained with the training set, and the trained model is tested with the validation set, which is used as the performance indicator of the classifier.
Here, we mainly use K-fold cross validation (generally choose 5-fold) to get the most reasonable C and gamma in the model (the parameter list of the model is as follows), and the grid in the official tools folder Py is to solve the optimized parameters. When defining the parameter range, - 5 < = log2c < = 15, - 15 < = log2g < = 3, the step size is 2. In C + +, we need to call SVM in svm.cpp_ cross_ The validation function traverses the parameters to perform cross validation.
struct svm_parameter { int svm_type; int kernel_type; int degree; /* for poly */ double gamma; /* for poly/rbf/sigmoid */ double coef0; /* for poly/sigmoid */ /* these are for training only */ double cache_size; /* in MB */ double eps; /* stopping criteria */ double C; /* for C_SVC, EPSILON_SVR and NU_SVR */ int nr_weight; /* for C_SVC */ int *weight_label; /* for C_SVC */ double* weight; /* for C_SVC */ double nu; /* for NU_SVC, ONE_CLASS, and NU_SVR */ double p; /* for EPSILON_SVR */ int shrinking; /* use the shrinking heuristics */ int probability; /* do probability estimates */ };
//Cross validation optimization parameter solution double* target = new double[prob.l]; int logG, logC; int bestG, bestC;//Record the best parameter values int minCount = prob.l;//Number of recorded errors std::vector<double>rates;//Record the correct rate corresponding to each combination for (logC = -5;logC <= 15;logC += 2) { for (logG = -15;logG <= 3;logG += 2) { double c = pow(2, logC); double g = pow(2, logG); setParam(c, g);//Modify model parameters svm_cross_validation(&prob, ¶m, 5, target); int count = 0; for (int i = 0;i < prob.l;i++) { if (target[i] != i % 11) count++; } if (count < minCount) { minCount = count; bestC = c; bestG = g; } rates.push_back(1.0*(prob.l-count) / prob.l*100); } } //Output each pair of parameters and corresponding probability std::ofstream out("rates.txt"); int count1 = 0; for (logC = -5;logC <= 15;logC += 2) { for (logG = -15;logG <= 3;logG += 2) { std::string s1 = "log2c="; s1 += std::to_string(logC); std::string s2 = "log2g="; s2 += std::to_string(logG); std::string s3 = "rate="; s3 += std::to_string(rates[count1]); count1++; out << s1 << " " << s2 << " " << s3 << std::endl; } }
5. Model training
Model training mainly calls the official svm_train function.
std::cout << "start training" << std::endl; svm_model *svmModel = svm_train(&prob, ¶m); std::cout << "save model" << std::endl; svm_save_model(modelFileName.c_str(), svmModel); std::cout << "done!" << std::endl;
After the training is completed, the corresponding model file will be generated at the specified location modelFileName.
6. Model test (prediction)
The model is tested using SVM_ The process of the predict function (or svm_predict_probability function) is similar to the above. After the file is imported and scaled, the imported model is directly used for prediction without cross validation to obtain the predicted result. The return value type is double because the function includes classification and regression. What we return in the regression is actually the classification number (int type data) of a classification in the labels we entered.
void ClassificationSVM::predict(const std::string& featureFileName, const std::string& modelFileName) { //Read the features in the feature file and save the model svm_model *model = svm_load_model(modelFileName.c_str()); readTxt2(featureFileName); svmScale(false); //Construct prob from vector int count = 0;//Correct prediction count for (int i = 0;i < dataVec.size();i++) { svm_node *sample = new svm_node[featureDim + 1]; for (int j = 0; j < featureDim; ++j) { sample[j].index = j + 1; sample[j].value = dataVec[i][j]; } sample[featureDim].index = -1; //double *probresut = new double[11]; //double resultLabel = svm_predict_probability(model, sample, probresut); double resultLabel2 = svm_predict(model, sample); if (resultLabel - labels[i] < 1e-5) count++; //std::cout << resultLabel2 << std::endl; } double possibility = 1.0* count / dataVec.size();//Accuracy }
7. Reference and link
Main reference: Yes, but there is no cross validation in this