Machine learning K-means algorithm (2D)

The Chinese name of K-means algorithm is K-means clustering algorithm.


Cluster ing: for a large number of data sets with unknown labels, divide the data set into multiple clusters according to the internal similarity of data, so as to make the data similarity within the family as large as possible and the data similarity between categories as small as possible.

Why clustering: customer segmentation is a method of discovering user characteristics. v grouping a data-based customer information; So as to give you an overview of customer information, which can be directly transformed into marketing strategies for different customers.

Application fields:

Economic field:

  • Help market analysts find different customer groups from the customer database
  • Cluster the residential areas to determine the location of ATM
  • Analyze the stock market sector and find out the most dynamic sector leading stocks
  • Enterprise credit rating classification

Biological field:

  • Derivation of classification of plants and animals
  • Classify the genes and get an understanding of the population


  • As a preprocessing step of other mathematical algorithms, the data distribution is obtained


Characteristics of "class" in cluster analysis:

  • The class of clustering is not given in advance, but divided according to the similarity and distance of data;
  • The number and structure of clusters are not assumed in advance

The purpose of clustering method is to find:

  • Potential natural grouping structure
  • Relationship of interest

Overview of K-means algorithm

What is K?
K is the number of classes in the clustering algorithm.

What is means?
means is the mean algorithm.

Summary: K-means is a hard clustering algorithm that uses the mean algorithm to divide the data into k categories!

about even Continued type genus nature \red {continuous attribute} Continuous attributes have good clustering effect and are not suitable for dealing with discrete attributes.

K-means evaluation criteria

A good clustering method should be able to produce high-quality clustering results - clusters. These clusters should have the following two characteristics:

  • High intra cluster similarity
  • Low inter cluster similarity

base book thinking Think \red {basic thought} Basic idea: it belongs to the partitioning method, which divides the data set into different categories (or clusters) through iteration, so as to optimize the criterion function for evaluating the clustering performance, and make each cluster compact within the cluster and independent between the clusters.

Basic flow of algorithm

  1. k points are randomly selected as the center of the initial cluster, and each center represents each cluster
  2. Calculate the distance from all points to these k centers, and classify the points into the nearest cluster
  3. Adjust the cluster center, that is, move the cluster center to the geometric center of the cluster (i.e. average value)
  4. Repeat steps 2 and 3 until the center of the cluster no longer moves, and the algorithm converges

Main factors of K-means algorithm

Advantages and disadvantages of K-means


  1. The idea is simple and easy
  2. The time complexity is nearly linear
  3. For large data sets, it has high efficiency and scalability


  1. Selection dependent on initial mean
  2. The clustering number k value must be given in advance
  3. Sensitive to noise and isolated data

python implements the two-dimensional K-means algorithm and draws pictures with matplotlib

The data here is random data. Just change the input method.

import numpy as np
import matplotlib.pyplot as plt

def distance(e1, e2):
    return np.sqrt((e1[0] - e2[0]) ** 2 + (e1[1] - e2[1]) ** 2)

def means(arr):
    return np.array([np.mean([e[0] for e in arr]), np.mean([e[1] for e in arr])])

def farthest(k_arr, arr):
    f = [0, 0]
    max_d = 0
    for e in arr:
        d = 0
        for i in range(k_arr.__len__()):
            d = d + np.sqrt(distance(k_arr[i], e))
        if d > max_d:
            max_d = d
            f = e
    return f

def closest(a, arr):
    c = arr[1]
    min_d = distance(a, arr[1])
    arr = arr[1:]
    for e in arr:
        d = distance(a, e)
        if d < min_d:
            min_d = d
            c = e
    return c

if __name__ == "__main__":

    arr = np.random.randint(100, size=(100, 1, 2))[:, 0, :]

    m = 5
    r = np.random.randint(arr.__len__() - 1)
    k_arr = np.array([arr[r]])
    cla_arr = [[]]
    for i in range(m - 1):
        k = farthest(k_arr, arr)
        k_arr = np.concatenate([k_arr, np.array([k])])

    n = 20
    cla_temp = cla_arr
    for i in range(n):
        for e in arr:
            ki = 0
            min_d = distance(e, k_arr[ki])
            for j in range(1, k_arr.__len__()):
                if distance(e, k_arr[j]) < min_d:
                    min_d = distance(e, k_arr[j])
                    ki = j

        for k in range(k_arr.__len__()):
            if n - 1 == i:
            k_arr[k] = means(cla_temp[k])
            cla_temp[k] = []

    col = ['HotPink', 'Aqua', 'Chartreuse', 'yellow', 'LightSalmon']
    for i in range(m):
        plt.scatter(k_arr[i][0], k_arr[i][1], linewidth=10, color=col[i])
        plt.scatter([e[0] for e in cla_temp[i]], [e[1] for e in cla_temp[i]], color=col[i])

Tags: Machine Learning

Posted by Gordicron on Tue, 19 Apr 2022 03:25:09 +0930