Actual Kaggle competition: house price forecast
Read dataset
Both data sets include the characteristics of each house, such as street type, year of construction, roof type, basement condition and other characteristic values. These eigenvalues have continuous numbers, discrete labels, and even the missing value "na". Only the training data set includes the price of each house, that is, the label.
Now use pandas to read these two files.
#Training data loading train_data = pd.read_csv('../data/kaggle_house_pred_train.csv') #Test data loading test_data = pd.read_csv('../data/kaggle_house_pred_test.csv')
View the first 4 features, the last 2 features and labels of the first 4 samples (SalePrice)
Select location with index below Purely integer-location based indexing for selection by position. train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]
We can see that the first feature is Id, which can help the model remember each training sample, but it is difficult to extend to the test sample. All the training data and 79 features of the test data are linked by sample.
#pd.concat:Concatenate pandas objects along a particular axis with optional set logic along the other axes. #The default here is a vertical link, that is, a train_data except labels and test_data linked together all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:])) shape(2919, 79)
Preprocessing data set
Standardize the features of continuous values: set the mean value of the feature on the whole data set as \ (\ mu \) and the standard deviation as \ (\ sigma \). Then, we can subtract \ (\ mu \) from each value of the feature and divide it by \ (\ sigma \) to get each normalized feature value. For the missing eigenvalue, we replace it with the mean value of the feature.
The data type of pandas is object type, which belongs to text (str) or mixed digital type.
#Filter out columns characterized by numbers numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index #For each column whose characteristic is a number, subtract $\ mu $from each value, and then divide by $\ sigma $to obtain each normalized characteristic value all_features[numeric_features] = all_features[numeric_features].apply( lambda x: (x - x.mean()) / (x.std())) # After standardization, the mean value of each feature becomes 0, so you can directly replace the missing value with 0 #pandas. DataFrame. Fill Na: Fill NA/NaN values using the specified method all_features[numeric_features] = all_features[numeric_features].fillna(0)
Convert discrete values to indicator features:
For example, assuming that there are two different discrete values RL and RM in the feature MSZoning, this step of transformation will remove the MSZoning feature and add two new features MSZoning_RL and MSZoning_RM with a value of 0 or 1. If the original value of a sample in MSZoning is RL, there is MSZoning_RL=1 and MSZoning_RM=0.
# dummy_na=True treats the missing value as a legal characteristic value and creates an indicator characteristic for it # help(pd.get_dummies): Convert categorical variable into dummy/indicator variables #dummy_na is True, that is, if the characteristic value of the current column has NaN, it is also counted as a characteristic of the column all_features = pd.get_dummies(all_features, dummy_na=True) all_features.shape
Get the data in NumPy format through the values attribute and convert it into NDArray for later training.
#Number of training sets: 1460 n_train = train_data.shape[0] #Training characteristics train_features = nd.array(all_features[:n_train].values) #Test characteristics test_features = nd.array(all_features[n_train:].values) #The training label changes shape=(1460,) to shape=(1460,1) train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))
Training model
A basic linear regression model and square loss function are used to train the model
#Declared square loss function loss = gloss.L2Loss() def get_net(): #Instantiation nn net = nn.Sequential() #Add output layer net.add(nn.Dense(1)) #Weight initialization net.initialize() #Return instance return net
Define the log root mean square error used to evaluate the model. Given the predicted value \ (\ hat y_1, \ldots, \hat y_n \) and the corresponding real label \ (y_1,\ldots, y_n \), it is defined as
The realization of log root mean square error is as follows:
def log_rmse(net, features, labels): # Set the value less than 1 to 1 to make the value more stable when taking logarithm clipped_preds = nd.clip(net(features), 1, float('inf')) rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean()) return rmse.asscalar()
def train(net, train_features, train_labels, test_features, test_labels, num_epochs, learning_rate, weight_decay, batch_size): #Training loss train_ls, test_ls = [], [] #Load data train_iter = gdata.DataLoader(gdata.ArrayDataset( train_features, train_labels), batch_size, shuffle=True) # Adam optimization algorithm is used here trainer = gluon.Trainer(net.collect_params(), 'adam', { 'learning_rate': learning_rate, 'wd': weight_decay}) #Iterate for epoch in range(num_epochs): #Take out the characteristics and labels of small batches for X, y in train_iter: with autograd.record(): #Calculate loss l = loss(net(X), y) l.backward() #Reverse iteration trainer.step(batch_size) #Add training error train_ls.append(log_rmse(net, train_features, train_labels)) if test_labels is not None: #Add test error test_ls.append(log_rmse(net, test_features, test_labels)) return train_ls, test_ls
\(k \) fold cross validation
It returns the training and verification data required for the i-fold cross validation.
def get_k_fold_data(k, i, X, y): #k cannot be less than or equal to 1 assert k > 1 #k sets divided, number of elements in each set fold_size = X.shape[0] // k #Initialize X_train,y_train X_train, y_train = None, None #Loop k sets for j in range(k): #slice() function implements slice object and returns a slice object class slice(start, stop[, step]) #idx is the j-th set slice object returned, start:j * fold_size,end:(j + 1) * fold_size idx = slice(j * fold_size, (j + 1) * fold_size) #Select the feature belonging to the j-th set and the label of the j-th set X_part, y_part = X[idx, :], y[idx] #If the current set is the i-fold cross validation, the current set is regarded as the validation model if j == i: X_valid, y_valid = X_part, y_part #If it is the first time to access the subsets of the remaining k-1, it will be assigned directly elif X_train is None: X_train, y_train = X_part, y_part #If you later access the k-1 subset, use the concat link else: X_train = nd.concat(X_train, X_part, dim=0) y_train = nd.concat(y_train, y_part, dim=0) #Return the k-1 set training model and the ith verification model in turn return X_train, y_train, X_valid, y_valid
In the \ (k \) fold cross validation, we train \ (k \) times and return the average error of training and validation.
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay, batch_size): #Initialize training set loss sum and verify set loss sum train_l_sum, valid_l_sum = 0, 0 #Cycle k times for i in range(k): #Obtain the i-th discount data data = get_k_fold_data(k, i, X_train, y_train) #Get net instance net = get_net() #Calculate the loss of training set and verification set train_ls, valid_ls = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size) #Calculate training set loss and train_l_sum += train_ls[-1] #Calculate verification set loss and valid_l_sum += valid_ls[-1] #Draw a picture if i == 0: d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse', range(1, num_epochs + 1), valid_ls, ['train', 'valid']) print('fold %d, train rmse %f, valid rmse %f' % (i, train_ls[-1], valid_ls[-1])) #Return the average loss of training set and verify the average loss of training set return train_l_sum / k, valid_l_sum / k
Forecast and submit results in Kaggle
def train_and_pred(train_features, test_features, train_labels, test_data, num_epochs, lr, weight_decay, batch_size): #Get net instance net = get_net() #Return training set loss train_ls, _ = train(net, train_features, train_labels, None, None, num_epochs, lr, weight_decay, batch_size) #Draw a picture d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse') print('train rmse %f' % train_ls[-1]) #Calculate forecast tab, preds shape=(1459,1) preds = net(test_features).asnumpy() #Add a column of salesprice to facilitate recording, preds reshape(1,-1). shape = (1,1459) test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0]) #The Id of the test set and the prediction results are spliced together submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1) #Submit in csv format submission.to_csv('submission.csv', index=False)
train_and_pred(train_features, test_features, train_labels, test_data, num_epochs, lr, weight_decay, batch_size)