Overview of k-nearest neighbor algorithm
Advantages: high precision, insensitive to outliers and no data input assumption
Disadvantages: high computational complexity and space complexity
Applicable data range: numerical type and nominal type
Working principle: there is a sample data set, also known as training sample set, and each sample set has labels. We know the corresponding relationship between each data and its classification. After inputting the data without labels, compare each feature of the new data with the feature used in the data pair in the sample book, and then extract the classification label of the data with the most similar feature (nearest neighbor) in the sample set through the algorithm, As the classification label of this experience data.
Simple code implementation
We simply implement the example of film classification. We know that the film has two characteristics, one is the number of kissing scenes, and the other is the number of fighting scenes. We need to use k-nearest neighbor algorithm to divide films into love films and action films.
1, Data preparation
We set the following data:
Fighting lens | Kissing lens | Film type |
3 | 104 | affectional film |
2 | 100 | affectional film |
1 | 81 | affectional film |
101 | 10 | action movie |
99 | 5 | action movie |
98 | 2 | action movie |
18 | 90 | unknown |
The data in the last row is the data we need to classify.
We use direct data creation here. Simple and rough!
import numpy as np import operator def createData(): data = array([[3, 104], [2, 100] [1, 81] [101, 10] [99, 5] [98, 2]]) labels = {'affectional film', 'affectional film', 'affectional film', 'action movie', 'action movie', 'action movie'} return data, labels
2, Algorithm implementation
We need to implement the following steps:
- Calculate the distance from the point of known category data to the new data point
- Arrange in order of increasing distance
- Select the k points with the smallest distance from the current point
- Determine the probability of occurrence of the type of k points
- Return the category with the highest probability of k points as the prediction classification of the current point
""" inX:New data point dataSet:Training set matrix labels:Tag set corresponding to training set """ def classify(inX, dataSet, labels, k): # Returns the size of the data set, and shap() returns the number of rows and columns. Here, we only need the number of rows dataSetsize = dataSet.shape[0] # Using numpy's tile() method, we construct a matrix with each row of new data and as many rows as the training set, and let them subtract the training set matrix diffMat = np.tile(inX, (dataSetsize, 1)) - dataSet # Square each element of the diffMat matrix sqDiffMat = diffMat ** 2 # Add each row of the diffMat matrix to get a 1 * 6 matrix. When the parameter axis=1 in the sum() function means to add each row, and when it is 0, it means to add columns sqDistance = sqDiffMat.sum(axis=1) # Open sqDistance with another root sign to get the distance. At this time, distance stores the distance from the new data point to each data point distance = sqDistance ** 0.5 # Sort the obtained distances according to the size, and return the order of their serial numbers to reflect the size. For example, the return of [4,8,1,6] is [1,0,3,2] sortedDistIndicies = distance.argsort() # Create an empty dictionary to store the type and number of movies in k data classCount = {} for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] # Use the get() function to correspond the movie type in the k data with the number of movies # get(voteIlabel, 0) indicates the value whose key is' voteilabel '. If there is no value, it is set to 0. If there is, it returns the value classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1 # The sort() method sorts by the size of the classCount value sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True) # Returns the key of the first key value pair return sortedClassCount[0][0]
3, Program result display
if __name__ == '__main__': newdata = [18, 90] data, group = createData() result = classify(newdata, data, group, 3) print("It is predicted that the type of this unknown film is:" + result)
For [18,90] films, we predict that they are romantic films.
4, Summary
The data here has only two features, but in reality, we encounter more high-dimensional data, often with hundreds of features. We can use Euclidean distance to calculate the distance.
Similar to other steps, classification is carried out by distance.
Based on this, we can speculate that the more the number of features of the data set, the more accurate the prediction result. The more the number of data sets, the more accurate the prediction result. The result is only prediction, not necessarily all right.