Simple principle of kNN classifier

Overview of k-nearest neighbor algorithm

Advantages: high precision, insensitive to outliers and no data input assumption

Disadvantages: high computational complexity and space complexity

Applicable data range: numerical type and nominal type

Working principle: there is a sample data set, also known as training sample set, and each sample set has labels. We know the corresponding relationship between each data and its classification. After inputting the data without labels, compare each feature of the new data with the feature used in the data pair in the sample book, and then extract the classification label of the data with the most similar feature (nearest neighbor) in the sample set through the algorithm, As the classification label of this experience data.

Simple code implementation

We simply implement the example of film classification. We know that the film has two characteristics, one is the number of kissing scenes, and the other is the number of fighting scenes. We need to use k-nearest neighbor algorithm to divide films into love films and action films.

1, Data preparation

We set the following data:

Fighting lensKissing lensFilm type
3104affectional film
2100affectional film
181affectional film
10110action movie
995action movie
982action movie

The data in the last row is the data we need to classify.

We use direct data creation here. Simple and rough!

import numpy as np
import operator

def createData():
    data = array([[3, 104],
                  [2, 100]
                  [1, 81]
                  [101, 10]
                  [99, 5]
                  [98, 2]])
    labels = {'affectional film', 'affectional film', 'affectional film', 'action movie', 'action movie', 'action movie'}
    return data, labels

2, Algorithm implementation

We need to implement the following steps:

  1. Calculate the distance from the point of known category data to the new data point
  2. Arrange in order of increasing distance
  3. Select the k points with the smallest distance from the current point
  4. Determine the probability of occurrence of the type of k points
  5. Return the category with the highest probability of k points as the prediction classification of the current point
inX:New data point
dataSet:Training set matrix
labels:Tag set corresponding to training set
def classify(inX, dataSet, labels, k):
    # Returns the size of the data set, and shap() returns the number of rows and columns. Here, we only need the number of rows
    dataSetsize = dataSet.shape[0]
    # Using numpy's tile() method, we construct a matrix with each row of new data and as many rows as the training set, and let them subtract the training set matrix
    diffMat = np.tile(inX, (dataSetsize, 1)) - dataSet
    # Square each element of the diffMat matrix
    sqDiffMat = diffMat ** 2
    # Add each row of the diffMat matrix to get a 1 * 6 matrix. When the parameter axis=1 in the sum() function means to add each row, and when it is 0, it means to add columns
    sqDistance = sqDiffMat.sum(axis=1)
    # Open sqDistance with another root sign to get the distance. At this time, distance stores the distance from the new data point to each data point
    distance = sqDistance ** 0.5
    # Sort the obtained distances according to the size, and return the order of their serial numbers to reflect the size. For example, the return of [4,8,1,6] is [1,0,3,2]
    sortedDistIndicies = distance.argsort()
    # Create an empty dictionary to store the type and number of movies in k data
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndicies[i]]
        # Use the get() function to correspond the movie type in the k data with the number of movies
        # get(voteIlabel, 0) indicates the value whose key is' voteilabel '. If there is no value, it is set to 0. If there is, it returns the value
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    # The sort() method sorts by the size of the classCount value
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    # Returns the key of the first key value pair
    return sortedClassCount[0][0]

3, Program result display

if __name__ == '__main__':
    newdata = [18, 90]
    data, group = createData()
    result = classify(newdata, data, group, 3)
    print("It is predicted that the type of this unknown film is:" + result)

For [18,90] films, we predict that they are romantic films.

4, Summary

The data here has only two features, but in reality, we encounter more high-dimensional data, often with hundreds of features. We can use Euclidean distance to calculate the distance.

Similar to other steps, classification is carried out by distance.

Based on this, we can speculate that the more the number of features of the data set, the more accurate the prediction result. The more the number of data sets, the more accurate the prediction result. The result is only prediction, not necessarily all right.

Tags: Machine Learning

Posted by mkarabulut on Sun, 17 Apr 2022 02:51:10 +0930