K-Means clustering algorithm: dimension reduction, image vector quantization

1. Overview:

One of the most important applications of K-Means clustering is vector quantization (VQ) on unstructured data (sound, image). Unstructured data often occupies more storage space, the file itself is relatively large, and the operation is very slow. We hope to reduce the size of unstructured data or simplify the structure of unstructured data on the premise of ensuring data quality. Vector quantization can help us achieve this goal. The essence of K-Means is a dimensionality reduction application, which is different from some other dimensionality reduction algorithms. For example, the dimensionality reduction of feature selection is to directly select the features that contribute the most to the model, the dimensionality reduction of PCA is to aggregate information, and the dimensionality reduction of vector quantization is to compress the size of information on the same sample size, neither changing the number of features nor the number of samples, but only changing the amount of information on the samples under these features. emmmm..... Fine products

For example, there is a set of 40 samples of data, each containing 40 different sets of information (x1,x2). We put this group of data into four categories and find four centroids. We think that the sample points of each cluster are very similar to their centroids, so the information they carry is about equal to the information carried by the centroid of their cluster. Then I, we can use the centroid of each cluster to cover the original sample. In this way, we have reduced the 40 values of 40 samples to 4 groups. Although the sample size is still 40, the 40 samples actually have only 4 values, that is, 4 centroids.

Case:

1. Picture exploration:

#Guide Package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin#Distance matching function for points in two sequences
from sklearn.datasets import load_sample_image#Class used to import picture data
from sklearn.utils import shuffle#shuffle the cards

#Instantiate and import Summer Palace pictures
china=load_sample_image('china.jpg')
print(china)
print(china.dtype)#View picture data type
print(china.shape)

#How many different colors are included?
newimage=china.reshape((427*640,3))
print(newimage.shape)

result=pd.DataFrame(newimage).drop_duplicates().shape#To repeat color
print(result)

#Image visualization
plt.figure(figsize=(15,15))
plt.imshow(china)#Import pictures formed by 3D array
plt.show()

After the picture exploration, we can learn that there are 9w colors in the picture. We hope to use K-Means to compress the colors to 64, that is, gather 9w colors into 64 categories and replace all 9w colors with 64 centroids.

For comparison, we also draw vector quantization images randomly compressed to 64 colors. We need to randomly select 64 sample points as the random centroid. Compare the above K-Means to observe the image visualization and the loss degree of image information.

K-Means vector quantization:

#Guide Package
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin#Distance matching function for points in two sequences
from sklearn.datasets import load_sample_image#Class used to import picture data
from sklearn.utils import shuffle#shuffle the cards

#Instantiate and import Summer Palace pictures
china=load_sample_image('china.jpg')

# Data preprocessing
n_clusters=64
china=np.array(china,dtype=np.float64)/china.max()#Normalized, compressed between [0,1]
w,h,d=original_shape=tuple(china.shape)#Convert Chinese from picture mode to matrix format
assert d==3#Ensure that d must be 3, otherwise an error will be reported
image_array=np.reshape(china,(w*h,d))#Change matrix structure

#KMeans vector quantization of data
image_array_sample=shuffle(image_array,random_state=0)[:1000]#The amount of data is large. First use 1000 numbers to find the centroid
kmeans=KMeans(n_clusters=n_clusters,random_state=0).fit(image_array_sample)
print(kmeans.cluster_centers_.shape)#Return 64 centroids
#All data are classified according to the centroid
labels=kmeans.predict(image_array)
print(labels.shape)

#Replace all samples with centroids
image_kmeans=image_array.copy()

for i in range(w*h):
    image_kmeans[i]=kmeans.cluster_centers_[labels[i]]

#Restore picture structure
image_kmeans=image_kmeans.reshape(w,h,d)
print(image_kmeans.shape)

#Finally, draw the image
#Original drawing
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Original image (96,615 colors)')
plt.imshow(china)
plt.show()

#After dimensionality reduction of kmeans
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors,k-means)')
plt.imshow(image_kmeans)
plt.show()

effect:

Random vector quantization:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import pairwise_distances_argmin
from sklearn.datasets import load_sample_image#Class used to import picture data
from sklearn.utils import shuffle#shuffle the cards

#Instantiate and import Summer Palace pictures
china=load_sample_image('china.jpg')

#3 data preprocessing
n_clusters=64
china=np.array(china,dtype=np.float64)/china.max()
#Convert Chinese from picture mode to matrix format
w,h,d=original_shape=tuple(china.shape)
assert d==3#Ensure that d must be 3, otherwise an error will be reported

image_array=np.reshape(china,(w*h,d))

#Random vector quantization of data

centroid_random=shuffle(image_array,random_state=0)[:n_clusters]#Find the centroid randomly
print(centroid_random.shape)
labels_random=pairwise_distances_argmin(centroid_random,image_array,axis=0)#Index of random centroid corresponding to 27w data

#Replace all samples with random centroids
image_random=image_array.copy()
for i in range(w*h):
    image_random[i]=centroid_random[labels_random[i]]

#Restore picture structure
image_random=image_random.reshape(w,h,d)
print(image_random.shape)

#display picture
plt.figure(figsize=(10,10))
plt.axis('off')
plt.title('Quantized image (64 colors,random)')
plt.imshow(image_random)
plt.show()

Effect:

Comparing the pictures after K-Means and random vector quantization, it can be seen that K-Means is much better than random vector quantization.

 

 

 

 

Tags: Machine Learning

Posted by BlueKai on Mon, 18 Apr 2022 13:39:46 +0930