The Chinese name of K-means algorithm is K-means clustering algorithm.
clustering
Cluster ing: for a large number of data sets with unknown labels, divide the data set into multiple clusters according to the internal similarity of data, so as to make the data similarity within the family as large as possible and the data similarity between categories as small as possible.
Why clustering: customer segmentation is a method of discovering user characteristics. v grouping a data-based customer information; So as to give you an overview of customer information, which can be directly transformed into marketing strategies for different customers.
Application fields:
Economic field:
- Help market analysts find different customer groups from the customer database
- Cluster the residential areas to determine the location of ATM
- Analyze the stock market sector and find out the most dynamic sector leading stocks
- Enterprise credit rating classification
Biological field:
- Derivation of classification of plants and animals
- Classify the genes and get an understanding of the population
other:
- As a preprocessing step of other mathematical algorithms, the data distribution is obtained
Principle:
Characteristics of "class" in cluster analysis:
- The class of clustering is not given in advance, but divided according to the similarity and distance of data;
- The number and structure of clusters are not assumed in advance
The purpose of clustering method is to find:
- Potential natural grouping structure
- Relationship of interest
Overview of K-means algorithm
What is K?
K is the number of classes in the clustering algorithm.
What is means?
means is the mean algorithm.
Summary: K-means is a hard clustering algorithm that uses the mean algorithm to divide the data into k categories!
about even Continued type genus nature \red {continuous attribute} Continuous attributes have good clustering effect and are not suitable for dealing with discrete attributes.
K-means evaluation criteria
A good clustering method should be able to produce high-quality clustering results - clusters. These clusters should have the following two characteristics:
- High intra cluster similarity
- Low inter cluster similarity
base book thinking Think \red {basic thought} Basic idea: it belongs to the partitioning method, which divides the data set into different categories (or clusters) through iteration, so as to optimize the criterion function for evaluating the clustering performance, and make each cluster compact within the cluster and independent between the clusters.
Basic flow of algorithm
- k points are randomly selected as the center of the initial cluster, and each center represents each cluster
- Calculate the distance from all points to these k centers, and classify the points into the nearest cluster
- Adjust the cluster center, that is, move the cluster center to the geometric center of the cluster (i.e. average value)
- Repeat steps 2 and 3 until the center of the cluster no longer moves, and the algorithm converges
Main factors of K-means algorithm
Advantages and disadvantages of K-means
advantage:
- The idea is simple and easy
- The time complexity is nearly linear
- For large data sets, it has high efficiency and scalability
Disadvantages:
- Selection dependent on initial mean
- The clustering number k value must be given in advance
- Sensitive to noise and isolated data
python implements the two-dimensional K-means algorithm and draws pictures with matplotlib
The data here is random data. Just change the input method.
import numpy as np import matplotlib.pyplot as plt def distance(e1, e2): return np.sqrt((e1[0] - e2[0]) ** 2 + (e1[1] - e2[1]) ** 2) def means(arr): return np.array([np.mean([e[0] for e in arr]), np.mean([e[1] for e in arr])]) def farthest(k_arr, arr): f = [0, 0] max_d = 0 for e in arr: d = 0 for i in range(k_arr.__len__()): d = d + np.sqrt(distance(k_arr[i], e)) if d > max_d: max_d = d f = e return f def closest(a, arr): c = arr[1] min_d = distance(a, arr[1]) arr = arr[1:] for e in arr: d = distance(a, e) if d < min_d: min_d = d c = e return c if __name__ == "__main__": arr = np.random.randint(100, size=(100, 1, 2))[:, 0, :] m = 5 r = np.random.randint(arr.__len__() - 1) k_arr = np.array([arr[r]]) cla_arr = [[]] for i in range(m - 1): k = farthest(k_arr, arr) k_arr = np.concatenate([k_arr, np.array([k])]) cla_arr.append([]) n = 20 cla_temp = cla_arr for i in range(n): for e in arr: ki = 0 min_d = distance(e, k_arr[ki]) for j in range(1, k_arr.__len__()): if distance(e, k_arr[j]) < min_d: min_d = distance(e, k_arr[j]) ki = j cla_temp[ki].append(e) for k in range(k_arr.__len__()): if n - 1 == i: break k_arr[k] = means(cla_temp[k]) cla_temp[k] = [] col = ['HotPink', 'Aqua', 'Chartreuse', 'yellow', 'LightSalmon'] for i in range(m): plt.scatter(k_arr[i][0], k_arr[i][1], linewidth=10, color=col[i]) plt.scatter([e[0] for e in cla_temp[i]], [e[1] for e in cla_temp[i]], color=col[i]) plt.show()