# Overview of k-nearest neighbor algorithm

Advantages: high precision, insensitive to outliers and no data input assumption

Disadvantages: high computational complexity and space complexity

Applicable data range: numerical type and nominal type

Working principle: there is a sample data set, also known as training sample set, and each sample set has labels. We know the corresponding relationship between each data and its classification. After inputting the data without labels, compare each feature of the new data with the feature used in the data pair in the sample book, and then extract the classification label of the data with the most similar feature (nearest neighbor) in the sample set through the algorithm, As the classification label of this experience data.

# Simple code implementation

We simply implement the example of film classification. We know that the film has two characteristics, one is the number of kissing scenes, and the other is the number of fighting scenes. We need to use k-nearest neighbor algorithm to divide films into love films and action films.

## 1, Data preparation

We set the following data:

 Fighting lens Kissing lens Film type 3 104 affectional film 2 100 affectional film 1 81 affectional film 101 10 action movie 99 5 action movie 98 2 action movie 18 90 unknown

The data in the last row is the data we need to classify.

We use direct data creation here. Simple and rough!

```import numpy as np
import operator

def createData():
data = array([[3, 104],
[2, 100]
[1, 81]
[101, 10]
[99, 5]
[98, 2]])
labels = {'affectional film', 'affectional film', 'affectional film', 'action movie', 'action movie', 'action movie'}
return data, labels```

## 2, Algorithm implementation

We need to implement the following steps:

1. Calculate the distance from the point of known category data to the new data point
2. Arrange in order of increasing distance
3. Select the k points with the smallest distance from the current point
4. Determine the probability of occurrence of the type of k points
5. Return the category with the highest probability of k points as the prediction classification of the current point
```"""
inX:New data point
dataSet:Training set matrix
labels:Tag set corresponding to training set
"""
def classify(inX, dataSet, labels, k):
# Returns the size of the data set, and shap() returns the number of rows and columns. Here, we only need the number of rows
dataSetsize = dataSet.shape
# Using numpy's tile() method, we construct a matrix with each row of new data and as many rows as the training set, and let them subtract the training set matrix
diffMat = np.tile(inX, (dataSetsize, 1)) - dataSet
# Square each element of the diffMat matrix
sqDiffMat = diffMat ** 2
# Add each row of the diffMat matrix to get a 1 * 6 matrix. When the parameter axis=1 in the sum() function means to add each row, and when it is 0, it means to add columns
sqDistance = sqDiffMat.sum(axis=1)
# Open sqDistance with another root sign to get the distance. At this time, distance stores the distance from the new data point to each data point
distance = sqDistance ** 0.5
# Sort the obtained distances according to the size, and return the order of their serial numbers to reflect the size. For example, the return of [4,8,1,6] is [1,0,3,2]
sortedDistIndicies = distance.argsort()
# Create an empty dictionary to store the type and number of movies in k data
classCount = {}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
# Use the get() function to correspond the movie type in the k data with the number of movies
# get(voteIlabel, 0) indicates the value whose key is' voteilabel '. If there is no value, it is set to 0. If there is, it returns the value
classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
# The sort() method sorts by the size of the classCount value
sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
# Returns the key of the first key value pair
return sortedClassCount```

## 3, Program result display

```if __name__ == '__main__':
newdata = [18, 90]
data, group = createData()
result = classify(newdata, data, group, 3)
print("It is predicted that the type of this unknown film is:" + result)
``` For [18,90] films, we predict that they are romantic films.

## 4, Summary

The data here has only two features, but in reality, we encounter more high-dimensional data, often with hundreds of features. We can use Euclidean distance to calculate the distance. Similar to other steps, classification is carried out by distance.

Based on this, we can speculate that the more the number of features of the data set, the more accurate the prediction result. The more the number of data sets, the more accurate the prediction result. The result is only prediction, not necessarily all right.

Tags: Machine Learning

Posted by mkarabulut on Sun, 17 Apr 2022 02:51:10 +0930