# 16 actual Kaggle competition: house price forecast

### Actual Kaggle competition: house price forecast

Both data sets include the characteristics of each house, such as street type, year of construction, roof type, basement condition and other characteristic values. These eigenvalues have continuous numbers, discrete labels, and even the missing value "na". Only the training data set includes the price of each house, that is, the label.
Now use pandas to read these two files.

#Training data loading


View the first 4 features, the last 2 features and labels of the first 4 samples (SalePrice)

Select location with index below
Purely integer-location based indexing for selection by position.
train_data.iloc[0:4, [0, 1, 2, 3, -3, -2, -1]]


We can see that the first feature is Id, which can help the model remember each training sample, but it is difficult to extend to the test sample. All the training data and 79 features of the test data are linked by sample.

#pd.concat:Concatenate pandas objects along a particular axis with optional set logic along the other axes.
#The default here is a vertical link, that is, a train_data except labels and test_data linked together
all_features = pd.concat((train_data.iloc[:, 1:-1], test_data.iloc[:, 1:]))
shape(2919, 79)


#### Preprocessing data set

Standardize the features of continuous values: set the mean value of the feature on the whole data set as \ (\ mu \) and the standard deviation as \ (\ sigma \). Then, we can subtract \ (\ mu \) from each value of the feature and divide it by \ (\ sigma \) to get each normalized feature value. For the missing eigenvalue, we replace it with the mean value of the feature.
The data type of pandas is object type, which belongs to text (str) or mixed digital type.

#Filter out columns characterized by numbers
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
#For each column whose characteristic is a number, subtract $\ mu$from each value, and then divide by $\ sigma$to obtain each normalized characteristic value
all_features[numeric_features] = all_features[numeric_features].apply(
lambda x: (x - x.mean()) / (x.std()))
# After standardization, the mean value of each feature becomes 0, so you can directly replace the missing value with 0
#pandas. DataFrame. Fill Na: Fill NA/NaN values using the specified method
all_features[numeric_features] = all_features[numeric_features].fillna(0)


Convert discrete values to indicator features:
For example, assuming that there are two different discrete values RL and RM in the feature MSZoning, this step of transformation will remove the MSZoning feature and add two new features MSZoning_RL and MSZoning_RM with a value of 0 or 1. If the original value of a sample in MSZoning is RL, there is MSZoning_RL=1 and MSZoning_RM=0.

# dummy_na=True treats the missing value as a legal characteristic value and creates an indicator characteristic for it
# help(pd.get_dummies): Convert categorical variable into dummy/indicator variables
#dummy_na is True, that is, if the characteristic value of the current column has NaN, it is also counted as a characteristic of the column
all_features = pd.get_dummies(all_features, dummy_na=True)
all_features.shape


Get the data in NumPy format through the values attribute and convert it into NDArray for later training.

#Number of training sets: 1460
n_train = train_data.shape[0]
#Training characteristics
train_features = nd.array(all_features[:n_train].values)
#Test characteristics
test_features = nd.array(all_features[n_train:].values)
#The training label changes shape=(1460,) to shape=(1460,1)
train_labels = nd.array(train_data.SalePrice.values).reshape((-1, 1))


#### Training model

A basic linear regression model and square loss function are used to train the model

#Declared square loss function
loss = gloss.L2Loss()

def get_net():
#Instantiation nn
net = nn.Sequential()
#Weight initialization
net.initialize()
#Return instance
return net


Define the log root mean square error used to evaluate the model. Given the predicted value \ (\ hat y_1, \ldots, \hat y_n \) and the corresponding real label \ (y_1,\ldots, y_n \), it is defined as

$\sqrt{\frac{1}{n}\sum_{i=1}^n\left(\log(y_i)-\log(\hat y_i)\right)^2}.$

The realization of log root mean square error is as follows:

def log_rmse(net, features, labels):
# Set the value less than 1 to 1 to make the value more stable when taking logarithm
clipped_preds = nd.clip(net(features), 1, float('inf'))
rmse = nd.sqrt(2 * loss(clipped_preds.log(), labels.log()).mean())
return rmse.asscalar()

def train(net, train_features, train_labels, test_features, test_labels,
num_epochs, learning_rate, weight_decay, batch_size):
#Training loss
train_ls, test_ls = [], []
train_features, train_labels), batch_size, shuffle=True)
# Adam optimization algorithm is used here
'learning_rate': learning_rate, 'wd': weight_decay})
#Iterate
for epoch in range(num_epochs):
#Take out the characteristics and labels of small batches
for X, y in train_iter:
#Calculate loss
l = loss(net(X), y)
l.backward()
#Reverse iteration
trainer.step(batch_size)
train_ls.append(log_rmse(net, train_features, train_labels))
if test_labels is not None:
test_ls.append(log_rmse(net, test_features, test_labels))
return train_ls, test_ls


#### $$k$$ fold cross validation

It returns the training and verification data required for the i-fold cross validation.

def get_k_fold_data(k, i, X, y):
#k cannot be less than or equal to 1
assert k > 1
#k sets divided, number of elements in each set
fold_size = X.shape[0] // k
#Initialize X_train,y_train
X_train, y_train = None, None
#Loop k sets
for j in range(k):
#slice() function implements slice object and returns a slice object class slice(start, stop[, step])
#idx is the j-th set slice object returned, start:j * fold_size,end:(j + 1) * fold_size
idx = slice(j * fold_size, (j + 1) * fold_size)
#Select the feature belonging to the j-th set and the label of the j-th set
X_part, y_part = X[idx, :], y[idx]
#If the current set is the i-fold cross validation, the current set is regarded as the validation model
if j == i:
X_valid, y_valid = X_part, y_part
#If it is the first time to access the subsets of the remaining k-1, it will be assigned directly
elif X_train is None:
X_train, y_train = X_part, y_part
#If you later access the k-1 subset, use the concat link
else:
X_train = nd.concat(X_train, X_part, dim=0)
y_train = nd.concat(y_train, y_part, dim=0)
#Return the k-1 set training model and the ith verification model in turn
return X_train, y_train, X_valid, y_valid


In the \ (k \) fold cross validation, we train \ (k \) times and return the average error of training and validation.

def k_fold(k, X_train, y_train, num_epochs,
learning_rate, weight_decay, batch_size):
#Initialize training set loss sum and verify set loss sum
train_l_sum, valid_l_sum = 0, 0
#Cycle k times
for i in range(k):
#Obtain the i-th discount data
data = get_k_fold_data(k, i, X_train, y_train)
#Get net instance
net = get_net()
#Calculate the loss of training set and verification set
train_ls, valid_ls = train(net, *data, num_epochs, learning_rate,
weight_decay, batch_size)
#Calculate training set loss and
train_l_sum += train_ls[-1]
#Calculate verification set loss and
valid_l_sum += valid_ls[-1]
#Draw a picture
if i == 0:
d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse',
range(1, num_epochs + 1), valid_ls,
['train', 'valid'])
print('fold %d, train rmse %f, valid rmse %f'
% (i, train_ls[-1], valid_ls[-1]))
#Return the average loss of training set and verify the average loss of training set
return train_l_sum / k, valid_l_sum / k


#### Forecast and submit results in Kaggle

def train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size):
#Get net instance
net = get_net()
#Return training set loss
train_ls, _ = train(net, train_features, train_labels, None, None,
num_epochs, lr, weight_decay, batch_size)
#Draw a picture
d2l.semilogy(range(1, num_epochs + 1), train_ls, 'epochs', 'rmse')
print('train rmse %f' % train_ls[-1])
#Calculate forecast tab, preds shape=(1459,1)
preds = net(test_features).asnumpy()
#Add a column of salesprice to facilitate recording, preds reshape(1,-1). shape = (1,1459)
test_data['SalePrice'] = pd.Series(preds.reshape(1, -1)[0])
#The Id of the test set and the prediction results are spliced together
submission = pd.concat([test_data['Id'], test_data['SalePrice']], axis=1)
#Submit in csv format
submission.to_csv('submission.csv', index=False)

train_and_pred(train_features, test_features, train_labels, test_data,
num_epochs, lr, weight_decay, batch_size)


Posted by fazzfarrell on Tue, 19 Apr 2022 02:01:51 +0930