Reference books:
Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, Second Edition;
Machine learning Zhou Zhihua
Compiler: Jupiter notebook
The formulas in the watermelon Book involved in this article are not deduced and explained
4.1 linear regression
The final multiple linear regression equation is:
Next use θ Replace in the figure β (not found) θ (figure of)
Generate a set of linear data to test the formula
import numpy as np X = 2 * np.random.rand(100, 1) y = 4 + 3 * X + np.random.rand(100, 1)
The data we generated is a set of distribution points of class y = 3 x + 4 (0 < x < 2)
np.random can be followed by rand(),. randn(),. randint() and other forms. For their usage, see here
We use the original standard equation to calculate θ
np. The inv() function in linalg inverts the matrix and uses dot() to calculate the inner product of the matrix
#np.c_ Merge by line; np. One generates 100 * 1 array X_b = np.c_[np.ones((100, 1)),X] #Calculate standard equation (XTX) - 1XTy theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y) theta_best
Output:
array([[4.44082239], [3.02945868]])
The above functions are detailed in:
Python numpy functions: zeros(), ones(), empty()
NP. Of numpy learning c_ Usage of
Now use what you asked for θ Make endpoint prediction on both sides
X_new = np.array([[0],[2]]) X_new_b = np.c_[np.ones((2,1)), X_new] y_predict = X_new_b.dot(theta_best) y_predict
Output:
array([[ 4.44082239], [10.49973976]])
Since linear regression is applied here, the prediction result must be a straight line. The prediction result obtained here is the straight line on the binary plane, so the straight line can be drawn according to the prediction of the endpoints (0, 2) on both sides
Draw the prediction results of the model:
import matplotlib.pyplot as plt plt.plot(X_new, y_predict, "r-") plt.plot(X, y, "b.") plt.axis([0, 2, 0, 15]) plt.show()
Above, we calculated the linear regression in the most simple way, and sklearn has the operation function of linear regression
#Scikit learn implements the above code from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X, y) lin_reg.intercept_, lin_reg.coef_
Output:
(array([4.44082239]), array([[3.02945868]]))
According to this article: coef in sklearn_ And intercept_ , we can probably infer that in the linear prediction formula f(x)=wx+b intercept_ Yes, b coef_ Yes w
At this time, we can probably understand the previous X_ The meaning of applying the function ones in b is not available in that method intercept_ And coef_ The first column 1 is multiplied by b, and then each column x is multiplied by the corresponding W. b is taken as w[0], and finally a row matrix of all the final results (i.e. theta_best) is obtained
Or we have a simpler way to find theta_best
theta_best_svd, residuals, rank, s = np.linalg.lstsq(X_b, y, rcond=1e-6) theta_best_svd
Output:
array([[4.44082239], [3.02945868]])
This function calculates (X +) y (where X + is the pseudo inverse of X), but you can use NP linalg. PINV () calculates the pseudo inverse directly
np.linalg.pinv(X_b).dot(y)
Output:
array([[4.44082239], [3.02945868]])
In this way, we have four methods to calculate the predicted function value
4.2 gradient descent
See books or materials for the theoretical part
4.2.1 batch gradient descent
Formula: θ (next step)= θ-η ▽ θ MSE( θ)
Take a look at the quick implementation of this formula
#Batch gradient descent eta = 0.1 n_iterations = 1000 m = 100 theta = np.random.randn(2,1) #randn is a normal distribution for iteration in range(n_iterations): gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) theta = theta - eta * gradients theta
Output:
array([[4.44082239], [3.02945868]])
We found that as like as two peas of our previous four methods, we found that this data was very perfect, and what we did not lose was that we set up a suitable eta. η Learning rate) value
If you want to find a suitable learning rate, you can use grid search, which will not be described here
4.2.2 random gradient descent
The advantage of random gradient descent is that it can jump out of the local minimum value, but the defect is also obvious. The minimum value can never be found. Its principle is to randomly find a sample point every time for the next step of data adjustment
We can get a good scheme with 50 traversals (compared with 1000 above)
#Random gradient descent n_epochs = 50 t0, t1 =5, 50 def learning_schedule(t): return t0 / (t + t1) theta = np.random.randn(2,1) for epoch in range(n_epochs): for i in range(m): random_index = np.random.randint(m) #randint: integer between 0-m xi = X_b[random_index:random_index+1] yi = y[random_index:random_index+1] gradients = 2 * xi.T.dot(xi.dot(theta) - yi) eta = learning_schedule(epoch * m +i) theta = theta -eta * gradients theta
Output:
array([[4.426646 ], [3.04030804]])
The random gradient descent can be simply expressed by sklearn
#Scikit learn implements the above code from sklearn.linear_model import SGDRegressor sgd_reg = SGDRegressor(max_iter=1000, tol=1e-3, penalty=None, eta0=0.1) sgd_reg.fit(X, y.ravel()) sgd_reg.intercept_, sgd_reg.coef_
(array([4.43004106]), array([3.05933965]))
4.3 polynomial regression
When the data is more complex than a straight line, we need to use polynomial regression
We generate a new set of data
import numpy as np m = 100 X = 6 * np.random.rand(m, 1) - 3 y = 0.5 * X**2 + X + 2 + np.random.rand(m, 1)
This is a distribution point set of class y = 1/2 X*X + X + 2 (- 3 < x < 3). Let's take a look
import matplotlib.pyplot as plt plt.plot(X, y, 'b.') plt.show()
Therefore, a straight line can never fit this data correctly. We use the PolynomialFeatures class to convert the training data and take the square of each feature (degree = 2) as the new feature
from sklearn.preprocessing import PolynomialFeatures poly_features = PolynomialFeatures(degree=2, include_bias=False) X_poly = poly_features.fit_transform(X) X[0]
Output:
array([2.41278383])
X_poly[0] #The original feature of X[0] and the square of the feature
Output:
array([2.41278383, 5.82152582])
X_poly[0] now contains the original feature of X and the square of the feature. Now fit the LinearRegression model to this training data
from sklearn.linear_model import LinearRegression lin_reg = LinearRegression() lin_reg.fit(X_poly, y, sample_weight=2) lin_reg.intercept_, lin_reg.coef_ #b && theta
Output:
(array([2.50160549]), array([[1.00300224, 0.49453891]]))
Let's draw an image of this prediction
Note: 1 I wrote this step separately and predicted it all the time. However, later, I went to the official website of sklearn and found that the fit() function of LinearRegression has a parameter of sample_weight means there are several features (default = None), so I added this parameter to fit above to predict
2. The principle of drawing is to make a connection before two adjacent points, so I use backup to store X data and sort it (axis=0 is to sort by columns in two dimensions), and square it again with the previous method, that is, I can predict according to the previous method
backup = X backup = np.sort(backup,axis=0) X_sorted_poly = poly_features.fit_transform(backup) y_sorted_pred = lin_reg.predict(X_sorted_poly) plt.plot(backup, y_sorted_pred, 'r-') plt.plot(X, y, 'b.') plt.show()
4.4 learning curve
Learning curve: in addition to cross validation, another way to distinguish between models that are too simple or complex
Is the performance function of the training set size on the training set and verification set. The abscissa is the training set size and the ordinate is RMSE (root mean square error)
import matplotlib.pyplot as plt from sklearn.metrics import mean_squared_error from sklearn.model_selection import train_test_split def plot_learning_curves(model, X, y): X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2) train_errors, val_errors= [], [] for m in range(1,len(X_train)): model.fit(X_train[:m],y_train[:m]) y_train_predict = model.predict(X_train[:m]) y_val_predict = model.predict(X_val) train_errors.append(mean_squared_error(y_train[:m], y_train_predict)) val_errors.append(mean_squared_error(y_val, y_val_predict)) plt.plot(np.sqrt(train_errors), "r-+", linewidth=2, label="train") plt.plot(np.sqrt(val_errors), "b-", linewidth=3, label="val") return X_train, X_val, y_train, y_val lin_reg = LinearRegression() X_train, X_val, y_train, y_val = plot_learning_curves(lin_reg, X, y)
We can see that when there are only one or two examples at the beginning, the model can fit well. Therefore, when RSME is 0, the more data, the model cannot fit the training data perfectly, because the data is noisy and not linear. Therefore, the error in the training data will rise until it reaches equilibrium.
The above curve is an under fitting model, and both curves reach a stable state, which is very close and high.
Solution: increase training order
Over fitting: the error of training data is low, but there is a gap between the curves.
Solution: use a larger training set
from sklearn.pipeline import Pipeline polynomial_regression = Pipeline([ ("poly_fetures", PolynomialFeatures(degree=4, include_bias=False)), ("lin_reg", LinearRegression()), ]) X_train, X_val, y_train, y_val = plot_learning_curves(polynomial_regression, X, y)
4.5 regularized linear model
Everything is to prevent over fitting.
Regularization is achieved by constraining the weight of the model
4.5.1 ridge regression (Tikhonov regularization)
Ridge regression
#Ridge regression from sklearn.linear_model import Ridge ridge_reg = Ridge(alpha=1, solver="cholesky") ridge_reg.fit(X, y) ridge_reg.predict([[1.5]])
array([[5.5626674]])
The random gradient descent method is used
#Random gradient descent from sklearn.linear_model import SGDRegressor sgd_reg = SGDRegressor(penalty="l2") sgd_reg.fit(X, y.ravel()) sgd_reg.predict([[1.5]])
array([5.55077116])
4.5.2 Lasso regression
#Lasso regression from sklearn.linear_model import Lasso lasso_reg = Lasso(alpha=0.1) #Equivalent to lasso_reg = SGDRegressor(penalty="l1") lasso_reg.fit(X, y.ravel()) lasso_reg.predict([[1.5]])
array([5.52508037])
4.5.3 elastic network
#Elastic network: combining Lasso and ridge regression is Lasso's improved model from sklearn.linear_model import ElasticNet elastic_net = ElasticNet(alpha=0.1, l1_ratio=0.5) elastic_net.fit(X, y) elastic_net.predict([[1.5]])
array([5.52104439])
4.5.4 early stop method
After a round of training, the RMSE is declining at first, but it will rise after a certain number of rounds, which is the embodiment of over fitting. Therefore, we stipulate to stop training when the error value reaches the minimum.
#Early stop method from sklearn.base import clone from sklearn.pipeline import Pipeline from sklearn.preprocessing import PolynomialFeatures from sklearn.preprocessing import StandardScaler #prepare the data poly_scaler = Pipeline([ ("poly_features", PolynomialFeatures(degree=90, include_bias=False)), ("std_scaler", StandardScaler()) ]) X_train_poly_scaled = poly_scaler.fit_transform(X_train) X_val_poly_scaled = poly_scaler.transform(X_val) sge_reg = SGDRegressor(max_iter=1, tol=-np.infty, warm_start=True, penalty=None, learning_rate="constant", eta0=0.0005) minimum_val_error = float("inf") best_epoch = None best_model = None for epoch in range(1000): sgd_reg.fit(X_train_poly_scaled, y_train.ravel()) y_val_predict = sgd_reg.predict(X_val_poly_scaled) val_error = mean_squared_error(y_val, y_val_predict) if val_error < minimum_val_error: minimum_val_error = val_error best_epoch = epoch best_model = clone(sgd_reg)
4.6 logistic regression
4.6.1 decision boundary
We used a new data set: 150 flowers of three varieties
from sklearn import datasets iris = datasets.load_iris() list(iris.keys())
Output:
['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename']
We only use the petal width feature to create a classifier to check whether it is flower "2"
import numpy as np X = iris["data"][:,3:] y = (iris["target"] == 2).astype(np.int)
Training a logistic regression model
from sklearn.linear_model import LogisticRegression log_reg = LogisticRegression() log_reg.fit(X, y)
Let's look at the flowers before 0-3cm. What is the probability estimated by the model
X_new = np.linspace(0, 3, 1000).reshape(-1, 1) y_proba = log_reg.predict_proba(X_new) import matplotlib.pyplot as plt plt.plot(X_new, y_proba[:, 1], "g-", label="Iris virginica") plt.plot(X_new, y_proba[:, 0], "b--", label="Not Iris virginica")
Estimate the 1.7 and 1.5 classifiers to predict whether it is a "2" flower
log_reg.predict([[1.7], [1.5]])
array([1, 0])
Yes and no, respectively
4.6.2 Softmax regression
We use softmax regression to divide flowers into three categories. When training with more than two categories, logistic regression is a good one to many way, multi_class = "multinomial" can be switched to hyperparametric regression
X = iris["data"][:,(2,3)] y = iris["target"] softmax_reg = LogisticRegression(multi_class="multinomial",solver="lbfgs",C=10) softmax_reg.fit(X, y)
Predict the species whose petals are 5cm long and 2cm wide
softmax_reg.predict([[5,2]])
array([2])
Look at the predicted score for each flower
softmax_reg.predict_proba([[5,2]])
array([[6.38014896e-07, 5.74929995e-02, 9.42506362e-01]])