Happy Learning AI Series - Computer Vision (4. Extra article) What is "Convolutional Neural Network"

This series is an AI quick start series refined by "MATRIX. The core of the matrix". It is characterized by concise content and fast learning. Relevant requirements: Students need to master the basics of Python programming, and also need to have a certain foundation of linear algebra and probability theory

Convolutional Neural Network (CNN) is a deep learning algorithm commonly used in image classification, object detection, and image segmentation tasks in the field of computer vision. Its core idea is to extract features from raw data through convolution operations, and then pass these features to fully connected layers for classification or regression.

In traditional image classification tasks, we need to manually extract features of images, such as edges, textures, colors, etc. However, the extraction of these features is very difficult because they are usually highly abstract and subjective. The convolutional layer of the convolutional neural network can automatically learn the features of the image, which greatly simplifies the process of feature extraction. Therefore, it can be better adapted to different tasks and datasets.

A convolutional neural network consists of multiple convolutional layers, pooling layers, and fully connected layers. The convolution layer is the core of the convolutional neural network, which includes multiple convolution kernels, and each convolution kernel can extract a feature in the image. The convolution operation can be understood as sliding the convolution kernel on the image, multiplying each pixel in the image with the weight in the convolution kernel, and then adding the products to get a new pixel value, and finally get a new feature map . Through the stacking of multiple convolutional layers, more and more abstract feature maps can be obtained, so as to better distinguish different objects.


Convolution Kernel is an important concept in convolution operation. It is a learnable filter used to extract the characteristics of input data. The convolution operation can be understood as sliding a small window on the input data, weighting and summing the data in each window, and obtaining a new value as the output. This small window is the convolution kernel. The size of the convolution kernel is usually square or rectangular, and you can specify the size according to your needs. The parameter values ​​of the convolution kernel are obtained by model training, and are continuously updated through the backpropagation algorithm, so that the model can gradually learn better features and parameters, thereby improving the performance of the model. In convolutional neural networks, convolution kernels are used to extract features from the input data. In each layer, there will be several convolution kernels for convolution operation, and the features extracted by each convolution kernel are different. In this way, the features of the input data can be extracted from different angles, resulting in a richer feature representation.

Suppose we have a grayscale image of size 5x5 pixels and the pixel values ​​are represented as a matrix:

  1  2  3  4  5
  6  7  8  9 10
 11 12 13 14 15
 16 17 18 19 20
 21 22 23 24 25

We can define a 2x2 convolution kernel (also known as a filter) as follows:

  1  0
  0 -1

The convolution operation places a kernel on the image, performs a multiplication centered on each pixel and sums the results. For example, placing the convolution kernel on the first pixel position (1), we get the result of the convolution operation: (1 * 1) + (2 * 0) + (6 * 0) + (7 * -1 ) = -6.

When applying the convolution kernel to the whole image, we shift it to the right by one pixel, then apply the convolution again. We repeat this process until a convolution kernel is applied to every location of the image. Finally, we get a new image with a size of 4x4 pixels and pixel values ​​represented as a matrix:

  -6 -8 -10 -4
 -11 -13 -15 -9
  -16 -18 -20 -14
  -21 -23 -25 -19

This matrix extracts the edge features of the image after convolution operation. It can be seen that after the convolution operation, the vertical edges in the original image are highlighted, while the horizontal edges are suppressed. This is because the weights in the convolution kernel are closer to the pixel values ​​of the vertical edges, but different from the pixel values ​​of the horizontal edges, so the features of the vertical edges are highlighted, while the features of the horizontal edges are suppressed.


The pooling layer is to reduce the size of the feature map and the number of parameters, and is usually added between the convolutional layers. It can compress each small region (e.g. 2x2) in a feature map into a single pixel value, reducing computation and memory usage.

The last is the fully connected layer, which takes all the pooled feature maps and converts them into vectors for classification or regression. Fully connected layers are similar to layers in traditional neural networks, but because of the large number of features they need to process, fewer fully connected layers are usually used in convolutional neural networks than in traditional neural networks.

Convolutional neural networks are usually trained using the backpropagation algorithm. The backpropagation algorithm is an optimization algorithm based on gradient descent, which can update network parameters by calculating the error between network predictions and actual labels. Through multiple iterations, the network can gradually learn better features and parameters, thereby improving the performance of the model.

This is equivalent to the process of the network adapting to the data through continuous self-adjustment, thereby improving the network's ability to understand and classify data. Therefore, convolutional neural networks have become one of the most popular deep learning models in many fields such as computer vision, speech recognition, and natural language processing.

In convolutional neural networks, the use of convolutional layers and pooling layers can reduce the number of parameters and calculations, making the network more efficient. At the same time, by adding more convolutional layers and pooling layers, the network can gradually extract more abstract features, thereby improving the performance of the network.

In addition to convolutional layers and pooling layers, convolutional neural networks have many other layers, such as fully connected layers, batch normalization layers, Dropout layers, and more. The use of these layers can further improve the performance of the network and avoid problems such as overfitting of the network.

Convolutional neural network is a very powerful deep learning model that can achieve excellent results in fields such as computer vision, speech recognition, and natural language processing. In the future, with the continuous development and improvement of deep learning technology, convolutional neural networks will play a more important role in more fields.

Let's take an example: use the Keras library to build a simple convolutional neural network and use it to classify handwritten digits.

First, we need to prepare the dataset. Here we use the MNIST handwritten digit dataset, which contains 60,000 training samples and 10,000 testing samples. This dataset can be downloaded and loaded directly in Keras:

from keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

The images in this dataset are all 28x28 pixel grayscale images with pixel values ​​ranging from 0 to 255. We need to convert these images into a form that the network can handle, that is, scale the pixel values ​​​​between 0 and 1, and convert it into a 4-dimensional tensor, the shape of the tensor is (number of samples, height, width, number of channels) .

train_images = train_images.reshape((60000, 28, 28, 1))
train_images = train_images.astype('float32') / 255

test_images = test_images.reshape((10000, 28, 28, 1))
test_images = test_images.astype('float32') / 255

Next, we build a simple convolutional neural network model. This model consists of two convolutional layers, one pooling layer and two densely connected layers.

from keras import layers
from keras import models

model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

Finally, we compile the model and start training.


model.fit(train_images, train_labels, epochs=5, batch_size=64)

test_loss, test_acc = model.evaluate(test_images, test_labels)
print('test_acc:', test_acc)

Through training, we can see that the accuracy of the model continues to improve as the number of iterations increases. Finally, we can use the test set to evaluate the model and output the accuracy of the test set.

Here is an example of a simple convolutional neural network that does a good job of classifying handwritten digits. In practical applications, we can design more complex convolutional neural networks according to the needs of specific tasks.

The following is a small case of judging handwritten numbers:

import tensorflow as tf
import numpy as np
import cv2

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# normalized data
x_train, x_test = x_train / 255.0, x_test / 255.0

# build model
model = tf.keras.models.Sequential([
    tf.keras.layers.Reshape((28, 28, 1), input_shape=(28, 28)),
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2, 2)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')

# compile model

# training model
model.fit(x_train, y_train, epochs=5)

# test model
model.evaluate(x_test, y_test)

# Read the image and process it
img = cv2.imread('test.jpg')
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(gray, (28, 28), interpolation=cv2.INTER_AREA)
normalized = resized / 255.0

# Make predictions on images
pred = model.predict(normalized.reshape(1, 28, 28, 1))

# print out the predicted results
pred_label = np.argmax(pred[0])
print("The predicted numbers are:", pred_label)

# display image
cv2.imshow('image', img)

Tags: Deep Learning Machine Learning Computer Vision AI Convolutional Neural Networks

Posted by ChrisA on Wed, 05 Apr 2023 21:12:47 +0930