Machine learning practice: Training Model

Reference books:
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Second Edition;
Machine learning Zhou Zhihua

Compiler: Jupiter notebook

The formulas in the watermelon Book involved in this article are not deduced and explained

4.1 linear regression

The final multiple linear regression equation is:

Next use θ Replace in the figure β (not found) θ (figure of)

Generate a set of linear data to test the formula

import numpy as np
 
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.rand(100, 1)

The data we generated is a set of distribution points of class y = 3 x + 4 (0 < x < 2)

np.random can be followed by rand(),. randn(),. randint() and other forms. For their usage, see here

We use the original standard equation to calculate θ
np. The inv() function in linalg inverts the matrix and uses dot() to calculate the inner product of the matrix

#np.c_ Merge by line; np. One generates 100 * 1 array
X_b = np.c_[np.ones((100, 1)),X]
#Calculate standard equation (XTX) - 1XTy
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)

theta_best

Output:

array([[4.44082239],
       [3.02945868]])

The above functions are detailed in:
Python numpy functions: zeros(), ones(), empty()
NP. Of numpy learning c_ Usage of

Now use what you asked for θ Make endpoint prediction on both sides

X_new = np.array([[0],[2]])
X_new_b = np.c_[np.ones((2,1)), X_new]
y_predict = X_new_b.dot(theta_best)
y_predict

Output:

array([[ 4.44082239],
       [10.49973976]])

Since linear regression is applied here, the prediction result must be a straight line. The prediction result obtained here is the straight line on the binary plane, so the straight line can be drawn according to the prediction of the endpoints (0, 2) on both sides

Draw the prediction results of the model:

import matplotlib.pyplot as plt
plt.plot(X_new, y_predict, "r-")
plt.plot(X, y, "b.")
plt.axis([0, 2, 0, 15])
plt.show()


Above, we calculated the linear regression in the most simple way, and sklearn has the operation function of linear regression

#Scikit learn implements the above code
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X, y)

lin_reg.intercept_, lin_reg.coef_

Output:

(array([4.44082239]), array([[3.02945868]]))

According to this article: coef in sklearn_ And intercept_ , we can probably infer that in the linear prediction formula f(x)=wx+b intercept_ Yes, b coef_ Yes w

At this time, we can probably understand the previous X_ The meaning of applying the function ones in b is not available in that method intercept_ And coef_ The first column 1 is multiplied by b, and then each column x is multiplied by the corresponding W. b is taken as w[0], and finally a row matrix of all the final results (i.e. theta_best) is obtained

Or we have a simpler way to find theta_best

theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6)
theta_best_svd

Output:

array([[4.44082239],
       [3.02945868]])

This function calculates (X +) y (where X + is the pseudo inverse of X), but you can use NP linalg. PINV () calculates the pseudo inverse directly

np.linalg.pinv(X_b).dot(y)

Output:

array([[4.44082239],
       [3.02945868]])

In this way, we have four methods to calculate the predicted function value

4.2 gradient descent

See books or materials for the theoretical part

4.2.1 batch gradient descent

Formula: θ (next step)= θ-η ▽ θ MSE( θ)

Take a look at the quick implementation of this formula

#Batch gradient descent

eta = 0.1
n_iterations = 1000
m = 100

theta = np.random.randn(2,1)
#randn is a normal distribution

for iteration in range(n_iterations):
    gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y)
    theta = theta - eta * gradients
    
theta

Output:

array([[4.44082239],
       [3.02945868]])

We found that as like as two peas of our previous four methods, we found that this data was very perfect, and what we did not lose was that we set up a suitable eta. η Learning rate) value

If you want to find a suitable learning rate, you can use grid search, which will not be described here

4.2.2 random gradient descent

The advantage of random gradient descent is that it can jump out of the local minimum value, but the defect is also obvious. The minimum value can never be found. Its principle is to randomly find a sample point every time for the next step of data adjustment

We can get a good scheme with 50 traversals (compared with 1000 above)

#Random gradient descent

n_epochs = 50
t0, t1 =5, 50

def learning_schedule(t):
    return t0 / (t + t1)

theta = np.random.randn(2,1)

for epoch in range(n_epochs):
    for i in range(m):
        random_index = np.random.randint(m)
        #randint: integer between 0-m
        xi = X_b[random_index:random_index+1]
        yi = y[random_index:random_index+1]
        gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
        eta = learning_schedule(epoch * m +i)
        theta = theta -eta * gradients
        
theta

Output:

array([[4.426646  ],
       [3.04030804]])

The random gradient descent can be simply expressed by sklearn

#Scikit learn implements the above code
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1)
sgd_reg.fit(X, y.ravel())

sgd_reg.intercept_, sgd_reg.coef_
(array([4.43004106]), array([3.05933965]))

4.3 polynomial regression

When the data is more complex than a straight line, we need to use polynomial regression

We generate a new set of data

import numpy as np
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.rand(m, 1)

This is a distribution point set of class y = 1/2 X*X + X + 2 (- 3 < x < 3). Let's take a look

import matplotlib.pyplot as plt
plt.plot(X, y, 'b.')
plt.show()


Therefore, a straight line can never fit this data correctly. We use the PolynomialFeatures class to convert the training data and take the square of each feature (degree = 2) as the new feature

from sklearn.preprocessing import PolynomialFeatures
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
X[0]

Output:

array([2.41278383])
X_poly[0]	#The original feature of X[0] and the square of the feature

Output:

array([2.41278383, 5.82152582])

X_poly[0] now contains the original feature of X and the square of the feature. Now fit the LinearRegression model to this training data

from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y, sample_weight=2)
lin_reg.intercept_, lin_reg.coef_	#b && theta

Output:

(array([2.50160549]), array([[1.00300224, 0.49453891]]))

Let's draw an image of this prediction

Note: 1 I wrote this step separately and predicted it all the time. However, later, I went to the official website of sklearn and found that the fit() function of LinearRegression has a parameter of sample_weight means there are several features (default = None), so I added this parameter to fit above to predict

2. The principle of drawing is to make a connection before two adjacent points, so I use backup to store X data and sort it (axis=0 is to sort by columns in two dimensions), and square it again with the previous method, that is, I can predict according to the previous method

backup = X
backup = np.sort(backup,axis=0)
X_sorted_poly = poly_features.fit_transform(backup)
y_sorted_pred = lin_reg.predict(X_sorted_poly)
plt.plot(backup, y_sorted_pred, 'r-')
plt.plot(X, y, 'b.')
plt.show()

4.4 learning curve

Learning curve: in addition to cross validation, another way to distinguish between models that are too simple or complex
Is the performance function of the training set size on the training set and verification set. The abscissa is the training set size and the ordinate is RMSE (root mean square error)

import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

def plot_learning_curves(model, X, y):
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
    train_errors, val_errors= [], []
    for m in range(1,len(X_train)):
        model.fit(X_train[:m],y_train[:m])
        y_train_predict = model.predict(X_train[:m])
        y_val_predict = model.predict(X_val)
        train_errors.append(mean_squared_error(y_train[:m], y_train_predict))
        val_errors.append(mean_squared_error(y_val, y_val_predict))
    plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train")
    plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val")
    return X_train, X_val, y_train, y_val

    
lin_reg = LinearRegression()
X_train, X_val, y_train, y_val = plot_learning_curves(lin_reg, X, y)


We can see that when there are only one or two examples at the beginning, the model can fit well. Therefore, when RSME is 0, the more data, the model cannot fit the training data perfectly, because the data is noisy and not linear. Therefore, the error in the training data will rise until it reaches equilibrium.
The above curve is an under fitting model, and both curves reach a stable state, which is very close and high.
Solution: increase training order
Over fitting: the error of training data is low, but there is a gap between the curves.
Solution: use a larger training set

from sklearn.pipeline import Pipeline

polynomial_regression = Pipeline([
    ("poly_fetures", PolynomialFeatures(degree=4, include_bias=False)),
    ("lin_reg", LinearRegression()),
])
X_train, X_val, y_train, y_val = plot_learning_curves(polynomial_regression, X, y)

4.5 regularized linear model

Everything is to prevent over fitting.
Regularization is achieved by constraining the weight of the model

4.5.1 ridge regression (Tikhonov regularization)

Ridge regression

#Ridge regression
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
ridge_reg.predict([[1.5]])
array([[5.5626674]])

The random gradient descent method is used

#Random gradient descent
from sklearn.linear_model import SGDRegressor
sgd_reg = SGDRegressor(penalty="l2")
sgd_reg.fit(X, y.ravel())
sgd_reg.predict([[1.5]])
array([5.55077116])

4.5.2 Lasso regression

#Lasso regression
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
#Equivalent to lasso_reg = SGDRegressor(penalty="l1")
lasso_reg.fit(X, y.ravel())
lasso_reg.predict([[1.5]])
array([5.52508037])

4.5.3 elastic network

#Elastic network: combining Lasso and ridge regression is Lasso's improved model
from sklearn.linear_model import ElasticNet
elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic_net.fit(X, y)
elastic_net.predict([[1.5]])
array([5.52104439])

4.5.4 early stop method

After a round of training, the RMSE is declining at first, but it will rise after a certain number of rounds, which is the embodiment of over fitting. Therefore, we stipulate to stop training when the error value reaches the minimum.

#Early stop method
from sklearn.base import clone
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler

#prepare the data
poly_scaler = Pipeline([
    ("poly_features", PolynomialFeatures(degree=90, include_bias=False)),
    ("std_scaler", StandardScaler())
])
X_train_poly_scaled = poly_scaler.fit_transform(X_train)
X_val_poly_scaled = poly_scaler.transform(X_val)

sge_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True,
                      penalty=None, learning_rate="constant", eta0=0.0005)

minimum_val_error = float("inf")
best_epoch = None
best_model = None
for epoch in range(1000):
    sgd_reg.fit(X_train_poly_scaled, y_train.ravel())
    y_val_predict = sgd_reg.predict(X_val_poly_scaled)
    val_error = mean_squared_error(y_val, y_val_predict)
    if val_error < minimum_val_error:
        minimum_val_error = val_error
        best_epoch = epoch
        best_model = clone(sgd_reg)

4.6 logistic regression

4.6.1 decision boundary

We used a new data set: 150 flowers of three varieties

from sklearn import datasets
iris = datasets.load_iris()
list(iris.keys())

Output:

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename']

We only use the petal width feature to create a classifier to check whether it is flower "2"

import numpy as np
X = iris["data"][:,3:]
y = (iris["target"] == 2).astype(np.int)

Training a logistic regression model

from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression()
log_reg.fit(X, y)

Let's look at the flowers before 0-3cm. What is the probability estimated by the model

X_new = np.linspace(0, 3, 1000).reshape(-1, 1)
y_proba = log_reg.predict_proba(X_new)
import matplotlib.pyplot as plt
plt.plot(X_new, y_proba[:, 1], "g-", label="Iris virginica")
plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris virginica")

Estimate the 1.7 and 1.5 classifiers to predict whether it is a "2" flower

log_reg.predict([[1.7], [1.5]])
array([1, 0])

Yes and no, respectively

4.6.2 Softmax regression

We use softmax regression to divide flowers into three categories. When training with more than two categories, logistic regression is a good one to many way, multi_class = "multinomial" can be switched to hyperparametric regression

X = iris["data"][:,(2,3)]
y = iris["target"]
softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10)
softmax_reg.fit(X, y)

Predict the species whose petals are 5cm long and 2cm wide

softmax_reg.predict([[5,2]])
array([2])

Look at the predicted score for each flower

softmax_reg.predict_proba([[5,2]])
array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])

Tags: Machine Learning

Posted by Dixsta on Thu, 14 Apr 2022 07:15:42 +0930