16 actual Kaggle competition: house price forecast

Actual Kaggle competition: house price forecast

Read dataset

Both data sets include the characteristics of each house, such as street type, year of construction, roof type, basement condition and other characteristic values. These eigenvalues have continuous numbers, discrete labels, and even the missing value "na". Only the training data set includes the price of each house, that is, the label.
Now use pandas to read these two files.

#Training data loading
train_data = pd.read_csv('../data/kaggle_house_pred_train.csv')
#Test data loading
test_data = pd.read_csv('../data/kaggle_house_pred_test.csv')

View the first 4 features, the last 2 features and labels of the first 4 samples (SalePrice)

Select location with index below
Purely integer-location based indexing for selection by position.
train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]

We can see that the first feature is Id, which can help the model remember each training sample, but it is difficult to extend to the test sample. All the training data and 79 features of the test data are linked by sample.

#pd.concat:Concatenate pandas objects along a particular axis with optional set logic along the other axes.
#The default here is a vertical link, that is, a train_data except labels and test_data linked together
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
shape(2919, 79)

Preprocessing data set

Standardize the features of continuous values: set the mean value of the feature on the whole data set as \ (\ mu \) and the standard deviation as \ (\ sigma \). Then, we can subtract \ (\ mu \) from each value of the feature and divide it by \ (\ sigma \) to get each normalized feature value. For the missing eigenvalue, we replace it with the mean value of the feature.
The data type of pandas is object type, which belongs to text (str) or mixed digital type.

#Filter out columns characterized by numbers
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
#For each column whose characteristic is a number, subtract $\ mu $from each value, and then divide by $\ sigma $to obtain each normalized characteristic value
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))
# After standardization, the mean value of each feature becomes 0, so you can directly replace the missing value with 0
#pandas. DataFrame. Fill Na: Fill NA/NaN values using the specified method
all_features[numeric_features] = all_features[numeric_features].fillna(0)

Convert discrete values to indicator features:
For example, assuming that there are two different discrete values RL and RM in the feature MSZoning, this step of transformation will remove the MSZoning feature and add two new features MSZoning_RL and MSZoning_RM with a value of 0 or 1. If the original value of a sample in MSZoning is RL, there is MSZoning_RL=1 and MSZoning_RM=0.

# dummy_na=True treats the missing value as a legal characteristic value and creates an indicator characteristic for it
# help(pd.get_dummies): Convert categorical variable into dummy/indicator variables
#dummy_na is True, that is, if the characteristic value of the current column has NaN, it is also counted as a characteristic of the column
all_features = pd.get_dummies(all_features, dummy_na=True)

Get the data in NumPy format through the values attribute and convert it into NDArray for later training.

#Number of training sets: 1460
n_train = train_data.shape[0]
#Training characteristics
train_features = nd.array(all_features[:n_train].values)
#Test characteristics
test_features = nd.array(all_features[n_train:].values)
#The training label changes shape=(1460,) to shape=(1460,1)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))

Training model

A basic linear regression model and square loss function are used to train the model

#Declared square loss function
loss = gloss.L2Loss()

def get_net():
    #Instantiation nn
    net = nn.Sequential()
    #Add output layer
    #Weight initialization
    #Return instance
    return net

Define the log root mean square error used to evaluate the model. Given the predicted value \ (\ hat y_1, \ldots, \hat y_n \) and the corresponding real label \ (y_1,\ldots, y_n \), it is defined as

\[\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log(y_i)-\log(\hat y_i)\right)^2}. \]

The realization of log root mean square error is as follows:

def log_rmse(net, features, labels):
    # Set the value less than 1 to 1 to make the value more stable when taking logarithm
    clipped_preds = nd.clip(net(features), 1, float('inf'))
    rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())
    return rmse.asscalar()
def train(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):
    #Training loss
    train_ls, test_ls = [], []
    #Load data
    train_iter = gdata.DataLoader(gdata.ArrayDataset(
        train_features, train_labels), batch_size, shuffle=True)
    # Adam optimization algorithm is used here
    trainer = gluon.Trainer(net.collect_params(), 'adam', {
        'learning_rate': learning_rate, 'wd': weight_decay})
    for epoch in range(num_epochs):
        #Take out the characteristics and labels of small batches
        for X, y in train_iter:
            with autograd.record():
                #Calculate loss
                l = loss(net(X), y)
            #Reverse iteration
        #Add training error
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            #Add test error
            test_ls.append(log_rmse(net, test_features, test_labels))
    return train_ls, test_ls

\(k \) fold cross validation

It returns the training and verification data required for the i-fold cross validation.

def get_k_fold_data(k, i, X, y):
    #k cannot be less than or equal to 1
    assert k > 1
    #k sets divided, number of elements in each set
    fold_size = X.shape[0] // k
    #Initialize X_train,y_train
    X_train, y_train = None, None
    #Loop k sets
    for j in range(k):
        #slice() function implements slice object and returns a slice object class slice(start, stop[, step])
        #idx is the j-th set slice object returned, start:j * fold_size,end:(j + 1) * fold_size
        idx = slice(j * fold_size, (j + 1) * fold_size)
        #Select the feature belonging to the j-th set and the label of the j-th set
        X_part, y_part = X[idx, :], y[idx]
        #If the current set is the i-fold cross validation, the current set is regarded as the validation model
        if j == i:
            X_valid, y_valid = X_part, y_part
        #If it is the first time to access the subsets of the remaining k-1, it will be assigned directly
        elif X_train is None:
            X_train, y_train = X_part, y_part
        #If you later access the k-1 subset, use the concat link
            X_train = nd.concat(X_train, X_part, dim=0)
            y_train = nd.concat(y_train, y_part, dim=0)
    #Return the k-1 set training model and the ith verification model in turn
    return X_train, y_train, X_valid, y_valid

In the \ (k \) fold cross validation, we train \ (k \) times and return the average error of training and validation.

def k_fold(k, X_train, y_train, num_epochs,
           learning_rate, weight_decay, batch_size):
    #Initialize training set loss sum and verify set loss sum
    train_l_sum, valid_l_sum = 0, 0
    #Cycle k times
    for i in range(k):
        #Obtain the i-th discount data
        data = get_k_fold_data(k, i, X_train, y_train)
        #Get net instance
        net = get_net()
        #Calculate the loss of training set and verification set
        train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
                                   weight_decay, batch_size)
        #Calculate training set loss and
        train_l_sum += train_ls[-1]
        #Calculate verification set loss and
        valid_l_sum += valid_ls[-1]
        #Draw a picture
        if i == 0:
            d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',
                         range(1, num_epochs + 1), valid_ls,
                         ['train', 'valid'])
        print('fold %d, train rmse %f, valid rmse %f'
              % (i, train_ls[-1], valid_ls[-1]))
    #Return the average loss of training set and verify the average loss of training set
    return train_l_sum / k, valid_l_sum / k

Forecast and submit results in Kaggle

def train_and_pred(train_features, test_features, train_labels, test_data,
                   num_epochs, lr, weight_decay, batch_size):
    #Get net instance
    net = get_net()
    #Return training set loss
    train_ls, _ = train(net, train_features, train_labels, None, None,
                        num_epochs, lr, weight_decay, batch_size)
    #Draw a picture
    d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')
    print('train rmse %f' % train_ls[-1])
    #Calculate forecast tab, preds shape=(1459,1)
    preds = net(test_features).asnumpy()
    #Add a column of salesprice to facilitate recording, preds reshape(1,-1). shape = (1,1459)
    test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
    #The Id of the test set and the prediction results are spliced together
    submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
    #Submit in csv format
    submission.to_csv('submission.csv', index=False)
train_and_pred(train_features, test_features, train_labels, test_data,
               num_epochs, lr, weight_decay, batch_size)

Posted by fazzfarrell on Tue, 19 Apr 2022 02:01:51 +0930