1. Overview:
One of the most important applications of K-Means clustering is vector quantization (VQ) on unstructured data (sound, image). Unstructured data often occupies more storage space, the file itself is relatively large, and the operation is very slow. We hope to reduce the size of unstructured data or simplify the structure of unstructured data on the premise of ensuring data quality. Vector quantization can help us achieve this goal. The essence of K-Means is a dimensionality reduction application, which is different from some other dimensionality reduction algorithms. For example, the dimensionality reduction of feature selection is to directly select the features that contribute the most to the model, the dimensionality reduction of PCA is to aggregate information, and the dimensionality reduction of vector quantization is to compress the size of information on the same sample size, neither changing the number of features nor the number of samples, but only changing the amount of information on the samples under these features. emmmm..... Fine products
For example, there is a set of 40 samples of data, each containing 40 different sets of information (x1,x2). We put this group of data into four categories and find four centroids. We think that the sample points of each cluster are very similar to their centroids, so the information they carry is about equal to the information carried by the centroid of their cluster. Then I, we can use the centroid of each cluster to cover the original sample. In this way, we have reduced the 40 values of 40 samples to 4 groups. Although the sample size is still 40, the 40 samples actually have only 4 values, that is, 4 centroids.
Case:
1. Picture exploration:
#Guide Package import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.metrics import pairwise_distances_argmin#Distance matching function for points in two sequences from sklearn.datasets import load_sample_image#Class used to import picture data from sklearn.utils import shuffle#shuffle the cards #Instantiate and import Summer Palace pictures china=load_sample_image('china.jpg') print(china) print(china.dtype)#View picture data type print(china.shape) #How many different colors are included? newimage=china.reshape((427*640,3)) print(newimage.shape) result=pd.DataFrame(newimage).drop_duplicates().shape#To repeat color print(result) #Image visualization plt.figure(figsize=(15,15)) plt.imshow(china)#Import pictures formed by 3D array plt.show()
After the picture exploration, we can learn that there are 9w colors in the picture. We hope to use K-Means to compress the colors to 64, that is, gather 9w colors into 64 categories and replace all 9w colors with 64 centroids.
For comparison, we also draw vector quantization images randomly compressed to 64 colors. We need to randomly select 64 sample points as the random centroid. Compare the above K-Means to observe the image visualization and the loss degree of image information.
K-Means vector quantization:
#Guide Package import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.cluster import KMeans from sklearn.metrics import pairwise_distances_argmin#Distance matching function for points in two sequences from sklearn.datasets import load_sample_image#Class used to import picture data from sklearn.utils import shuffle#shuffle the cards #Instantiate and import Summer Palace pictures china=load_sample_image('china.jpg') # Data preprocessing n_clusters=64 china=np.array(china,dtype=np.float64)/china.max()#Normalized, compressed between [0,1] w,h,d=original_shape=tuple(china.shape)#Convert Chinese from picture mode to matrix format assert d==3#Ensure that d must be 3, otherwise an error will be reported image_array=np.reshape(china,(w*h,d))#Change matrix structure #KMeans vector quantization of data image_array_sample=shuffle(image_array,random_state=0)[:1000]#The amount of data is large. First use 1000 numbers to find the centroid kmeans=KMeans(n_clusters=n_clusters,random_state=0).fit(image_array_sample) print(kmeans.cluster_centers_.shape)#Return 64 centroids #All data are classified according to the centroid labels=kmeans.predict(image_array) print(labels.shape) #Replace all samples with centroids image_kmeans=image_array.copy() for i in range(w*h): image_kmeans[i]=kmeans.cluster_centers_[labels[i]] #Restore picture structure image_kmeans=image_kmeans.reshape(w,h,d) print(image_kmeans.shape) #Finally, draw the image #Original drawing plt.figure(figsize=(10,10)) plt.axis('off') plt.title('Original image (96,615 colors)') plt.imshow(china) plt.show() #After dimensionality reduction of kmeans plt.figure(figsize=(10,10)) plt.axis('off') plt.title('Quantized image (64 colors,k-means)') plt.imshow(image_kmeans) plt.show()
effect:
Random vector quantization:
import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import pairwise_distances_argmin from sklearn.datasets import load_sample_image#Class used to import picture data from sklearn.utils import shuffle#shuffle the cards #Instantiate and import Summer Palace pictures china=load_sample_image('china.jpg') #3 data preprocessing n_clusters=64 china=np.array(china,dtype=np.float64)/china.max() #Convert Chinese from picture mode to matrix format w,h,d=original_shape=tuple(china.shape) assert d==3#Ensure that d must be 3, otherwise an error will be reported image_array=np.reshape(china,(w*h,d)) #Random vector quantization of data centroid_random=shuffle(image_array,random_state=0)[:n_clusters]#Find the centroid randomly print(centroid_random.shape) labels_random=pairwise_distances_argmin(centroid_random,image_array,axis=0)#Index of random centroid corresponding to 27w data #Replace all samples with random centroids image_random=image_array.copy() for i in range(w*h): image_random[i]=centroid_random[labels_random[i]] #Restore picture structure image_random=image_random.reshape(w,h,d) print(image_random.shape) #display picture plt.figure(figsize=(10,10)) plt.axis('off') plt.title('Quantized image (64 colors,random)') plt.imshow(image_random) plt.show()
Effect:
Comparing the pictures after K-Means and random vector quantization, it can be seen that K-Means is much better than random vector quantization.