Hello everyone, today I'd like to share with you the construction method of the loss function of yoov4. The composition of yoov4 and the loss function is similar to that of yoov3, except that yoov4 uses the CIO loss as the positioning loss of the target bounding box. It is strongly recommended that you read the following articles before reading this article:

Prediction frame decoding, adjust a priori frame: https://blog.csdn.net/dgvv4/article/details/124076352

Prediction frame positioning loss, various iou: https://blog.csdn.net/dgvv4/article/details/124039111

# 1. Introduction to loss function

## 1.1 positive and negative samples of prediction frame

The prediction frame generated by the network is divided into three cases: positive sample, negative sample and ignored part

Positive sample: responsible for predicting the target object. The center point of the real label frame of the object falls in a grid. The object is one of the three prior frames generated by the grid. It intersects with the iou of the real label frame and predicts the maximum prior frame, and constantly fits and approaches the label frame. Other grids and a priori boxes are not responsible for the object. Therefore, the prediction box with the largest label box iou in the generated prediction box is taken as a positive sample.

Negative sample: image background. Calculate the IOU intersection and union ratio between each prediction frame and all real label frames. If the maximum IOU of a priori frame and the label frames of all objects in the image is less than the threshold (generally 0.5), then the priori frame is considered to contain no target, recorded as a negative sample, and its confidence should be 0.

Ignore part: ignore those prediction boxes whose iou of prediction box and label box is greater than the threshold value of 0.5, but whose iou is not the largest.

Therefore, only positive samples and negative samples produce the loss function in YOLOV3. Positive samples have coordinate loss function, confidence loss function and category loss function, while negative samples have only confidence loss function.

## 1.2 loss function formula

The formula of YOLOV3 loss function is as follows:

Represents traversing S*S grids and traversing B prediction boxes generated by each grid. This formula represents traversing all prediction boxes

(1) Coordinate loss function of positive sample. Calculate the center point coordinate deviation and width height deviation between the positive sample prediction frame and the label frame.

(2) Positive sample confidence loss function, the closer the confidence of the positive sample prediction frame is to 1, the closer the loss is to 0.

Positive sample category loss function, for each positive sample prediction box, the binary cross entropy loss function is calculated by category and label box. The closer the predicted value is to 1, the smaller the loss function is

(3) Negative sample confidence loss function, the negative sample is the background part of the picture. The closer the confidence is to 0, the smaller the loss function is.

# 2. Code display

I have detailed the theoretical part of iou jiaobubi and Ciou's loss in the previous article. Here's a brief review. If you have doubts, please see the link above.

## 2.1 iou cross comparison

iou refers to the ratio of the intersection and union of the prediction frame and the real frame. Regardless of the scale of the bounding box, the output iou is always between 0 and 1, so it can better reflect the detection effect between the prediction frame and the real frame. As shown below. The higher the iou value, the higher the degree of overlap between the two boxes. When iou is 0, the two boxes do not overlap at all. When iou is 1, the two boxes completely overlap.

The code is as follows, named IOU py

Input parameters: box1 represents the information of the prediction box, shape=[b, w, h, num_anchor, 4], where 4 represents the center point coordinates, width and height of the prediction box. box2 represents the label box information, shape=[b, w, h, num_anchor, 4], where 4 represents the center point coordinates and width height of the label box.

Output result: iou intersection and union ratio, shape=[b, w, h, num_anchor, 1], representing the num generated by each mesh_ Anchors are multiple prediction frames, and each prediction frame has an iou intersection and union ratio

import tensorflow as tf #(1) Define iou losses def IOU(box1, box2): # Receive coordinate information of prediction frame box1_xy = box1[..., :2] # Process the detection box and center coordinates of all batch pictures box1_wh = box1[..., 2:4] # Width and height of all pictures box1_wh_half = box1_wh // 2 # half width and height box1_min = box1_xy - box1_wh_half # Upper left coordinate box1_max = box1_xy + box1_wh_half # Lower right coordinate # Receive the upper left and lower right coordinates of the real box in the same way as above box2_xy = box2[..., :2] box2_wh = box2[..., 2:4] box2_wh_half = box2_wh // 2 box2_min = box2_xy - box2_wh_half box2_max = box2_xy + box2_wh_half # Area of prediction box box1_area = box1_wh[..., 0] * box1_wh[..., 1] # Area of real frame box2_area = box2_wh[..., 0] * box2_wh[..., 1] # Find the xy coordinates of the intersection area intersect_min = tf.maximum(box1_min, box2_min) # Upper left coordinate of intersection intersect_max = tf.minimum(box1_max, box2_max) # Coordinates of the lower right corner of the intersection # The width and height of the intersection area of all pictures. If the two frames are separated, the width and height is 0 intersect_wh = tf.maximum(intersect_max - intersect_min, 0) # Calculate the area of intersection area intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1] # Calculate Union area union_area = box1_area + box2_area - intersect_area # To calculate the intersection union ratio, the denominator plus a small number is 0 iou = intersect_area / (union_area + tf.keras.backend.epsilon()) # Dimension expansion [b, W, h, num_anchor] = = > [b, W, h, num_anchor, 1] iou = tf.expand_dims(iou, axis=-1) return iou

## 2.2 Ciou intersection and merger ratio

Ciou intersection and union ratio introduces the center point distance and aspect ratio on the basis of iou calculating the overlapping area.

The length width ratio formula is as follows, whereRepresents a trade-off factor,Used to evaluate the uniformity of aspect ratio.

The calculation formula of Ciou loss is as follows: b represents the coordinate of the center point of the prediction frame, b_gt represents the coordinate of the center point of the real box,Represents the Euclidean distance between two center points, and c represents the length of the diagonal of the circumscribed rectangle of the two target bounding boxes.

The code is as follows, named ciou py

''' parameter box1: Input forecast box information, [b, w, h, num_anchor, 4], Where 4 represents the center coordinate of the box xy Width and height wh box2: Input real box information, [b, w, h, num_anchor, 4], Where 4 represents the center coordinate of the box xy Width and height wh Return value Ciou: Of each prediction box output CIOU value, [b, w, h, num_anchor, 1], One of them represents Ciou value ''' import tensorflow as tf import math #(1) Define CIOU calculation method def CIOU(box1, box2): # ① Calculate the iou first # Receive coordinate information of prediction frame box1_xy = box1[..., 0:2] # Center coordinate of prediction frame box1_wh = box1[..., 2:4] # Width and height of prediction box box1_wh_half = box1_wh // 2 # half the width and height of the prediction frame box1_min = box1_xy - box1_wh_half # Coordinates of the upper left corner of the prediction box box1_max = box1_xy + box1_wh_half # Coordinates of the lower right corner of the prediction box # Area of prediction box box1_area = box1_wh[..., 0] * box1_wh[..., 1] # Receive coordinate information of real frame box2_xy = box2[..., 0:2] # Center coordinates of the real frame box2_wh = box2[..., 2:4] # Width and height of real frame box2_wh_half = box2_wh // 2 # half width and height box2_min = box2_xy - box2_wh_half # Coordinates of the upper left corner of the real box box2_max = box2_xy + box2_wh_half # Coordinates of the lower right corner of the real box # Area of real frame box2_area = box2_wh[..., 0] * box2_wh[..., 1] # Coordinates of the upper left and lower right corners of the intersection intersect_min = tf.maximum(box1_min, box2_min) intersect_max = tf.minimum(box1_max, box2_max) # High intersection intersect_wh = intersect_max - intersect_min # Area of intersection intersect_area = intersect_wh[..., 0] * intersect_wh[..., 1] # Area of Union union_area = box1_area + box2_area - intersect_area # When calculating iou, the denominator plus a small number is 0 iou = intersect_area / (union_area + tf.keras.backend.epsilon()) # ② Find the minimum closed rectangle containing two boxes enclose_min = tf.minimum(box1_min, box2_min) # Upper left coordinate enclose_max = tf.maximum(box1_max, box2_max) # Lower right coordinate # Calculate diagonal distance enclose_distance = tf.reduce_sum(tf.square(enclose_max - enclose_min), axis=-1) # Calculate the distance between the center points of two frames. The calculation method is the same as above center_distance = tf.reduce_sum(tf.square(box1_xy - box2_xy), axis=-1) # ③ Considering the aspect ratio # tf.math.atan2() returns the angle between [- pi, pi] v = 4 * tf.square(tf.math.atan2(box1_wh[..., 0], box1_wh[..., 1]) - tf.math.atan2(box2_wh[..., 0], box2_wh[..., 1])) / (math.pi * math.pi) alpha = v / (1.0 - iou + v) # Calculate ciou ciou = iou - center_distance / enclose_distance - alpha * v # Dimension expansion [b, W, h, num_anchor] = = > [b, W, h, num_anchor, 1] ciou = tf.expand_dims(ciou, axis=-1) return ciou

## 2.3 prediction frame decoding, fine tuning a priori frame

Take the adjustment of a prior frame of a grid as an example, as shown in the figure below, the dashed box represents the prior frame with the largest iou value of the real label frame of the object, and the width and height of the prior frame is (pw, ph); The blue box represents the prediction box generated after fine tuning the a priori box.

(cx,cy) is the upper left coordinate (normalized coordinate) of the grid where the center point of the a priori frame is located. Since the coordinate offset (tx,ty) can be any number from negative infinity to positive infinity, sigmoid function is added to the offset to prevent excessive coordinate adjustment offset. Limit the coordinate offset to 0-1, and limit the center point of the prediction box to its grid. The height and width offset (tw, th) is the normalized width and height adjustment value. Width and height of final prediction frame (bw, bh)

The code is as follows, named anchors py

import tensorflow as tf #(1) Decode the information of the output layer of the network def anchors_decode(feats, anchors, num_classes, input_shape): ''' feats It is the output result of a feature layer, as shape=[b, 13, 13, 3*(5+num_classes)] anchors Represents each feature layer, Three a priori boxes for each feature point[3,2] num_classes Represents the number of classification categories input_shape Network input image height and width[416,416] ''' # Calculate several a priori boxes per grid = 3 num_anchors = len(anchors) # Obtain the shape of the width and height of the image mesh = [h, w] = [13,13] grid_shape = feats.shape[1:3] #(1) Obtain the coordinate information of each grid point in the grid # Obtain the x coordinate information of grid points [1] = = > [1,13,1,1] grid_x = tf.reshape(range(0, grid_shape[1]), shape=[1,-1,1,1]) # Expand in the y dimension, copy the previous data, and then directly connect it to the original data # [1,13,1,1]==>[13,13,3,1] grid_x = tf.tile(grid_x, [grid_shape[0], 1, num_anchors, 1]) # Obtain the y coordinate information of grid points in the same way as [13] = = > [13,1,1,1] grid_y = tf.reshape(range(0, grid_shape[0]), shape=[-1,1,1,1]) # Dimension expansion [13,1,1,1] = > [13,13,3,1] grid_y = tf.tile(grid_y, [1, grid_shape[1], num_anchors, 1]) # Merge [13,13,3,2] in the channel dimension, and the coordinate information of each grid, the horizontal and vertical coordinates are 0-12, grid = tf.concat([grid_x, grid_y], axis=-1) # Convert to TF Float32 type grid = tf.cast(grid, tf.float32) #(2) Adjust the information of a priori box, 13 * 13 grids, each grid has 3 a priori boxes, and each a priori box has (x,y) coordinates # [3,2]==>[1,1,3,2] anchors_tensor = tf.reshape(anchors, shape=[1,1,num_anchors,2]) # [1,1,3,2]==>[13,13,3,2] anchors_tensor = tf.tile(anchors_tensor, [grid_shape[0], grid_shape[1], 1, 1]) # Convert to float32 type anchors_tensor = tf.cast(anchors_tensor, tf.float32) #(3) Result of adjusting network output characteristic graph # [b, 13, 13, 3*(5+num_classes)] = [b, 13, 13, 3, (5+num_classes)] ''' Representative 13*13 Grid, Each grid has 3 a priori boxes, Each a priori box has(5+num_classes)Item information among, 5 representative: Center point coordinates(x,y), Width and height(w,h), Confidence c num_classes: Conditional probability that the detection box belongs to a category, VOC Data set equal to 20 ''' feats = tf.reshape(feats, shape=[-1, grid_shape[0], grid_shape[1], num_anchors, 5+num_classes]) #(4) Adjust the center coordinates and width height of a priori frame # The normalization of the center coordinates of the prediction frame can only be adjusted in the grid. anchor_xy = tf.nn.sigmoid(feats[..., :2]) box_xy = anchor_xy + grid # Prediction frame coordinates of each grid # Normalize each coordinate from range(0,13) to 0-1 box_xy = box_xy / tf.cast(grid_shape[::-1], dtype=feats.dtype) # The width and height of the prediction frame of the grid are normalized by default, and the width and height are indexed anchors_wh = tf.exp(feats[..., 2:4]) box_wh = anchors_wh * anchors_tensor # Width and height of prediction box # Normalize the value of width and height from 416 to 0-1 box_wh = box_wh / tf.cast(input_shape[::-1], dtype=feats.dtype) # Return forecast box information return feats, box_xy, box_wh

## 2.4 YOLOV4 loss function

The code is as follows. I have marked all the notes and calculated the confidence loss, classification loss and positioning loss according to the formula. Be sure to pay attention to the value of positive and negative samples. The notes in this paper take the third effective output layer of the network [b,13,13,3*(5+num_classes)] as an example. If you find any error in the code, please point it out in the comment area in time

import numpy as np import tensorflow as tf from iou import IOU # iou calculation method for importing two boxes from Ciou import CIOU # Ciou calculation method for importing two boxes from anchors import anchors_decode # The a priori frame is decoded and adjusted to obtain the normalized coordinates and width height ''' # Define the calculation method of loss function # ---------------------------------------------------------------------- # # features: list, [outputs_1, outputs_2, outputs_3, y_true1, y_true2, y_true3] # Three effective feature layers of outputs network output: (b,13,13,num_anchors*(5+num_classes)) large target, (b,26,26,num_anchors*(5+num_classes)) medium target, (b,52,52,num_anchors*(5+num_classes)) small target # y_true the label box corresponding to each feature layer: (b,13,13,num_anchors,5+num_classes) large target, (b,26,26,num_anchors,5+num_classes) medium target, (b,52,52,num_anchors,5+num_classes) small target # input_shape: the size of the input characteristic graph of the network, (h, w) = (416416) # anchors: the size of 9 a priori boxes generated by each grid # The three a priori boxes of each feature layer correspond to the index in the a priori box list # num_classes: number of categories classified # ignore_thresh: calculate the IOU between each a priori frame and all target ground truth s. If the maximum IOU of all objects in a priori frame and image is less than the threshold (generally 0.5), it is considered that the a priori frame does not contain targets and is recorded as a negative sample, and its confidence should be 0. # The remaining parameters are the weight parameters in the loss calculation formula # ---------------------------------------------------------------------- # ''' def yolo_loss(features, input_shape, anchors, anchors_mask, num_classes, ignore_thresh=0.5, box_ratio=0.05, balance=[0.4,1.0,4.0], obj_ratio=1.0, cls_ratio=0.125): # Three effective output layers and real label boxes are extracted from the input features outputs = features[:3] y_true = features[3:] # Convert the height and width of the input feature graph to the tensor type, (h, w) = (416416) input_shape = tf.cast(input_shape, outputs[0].dtype) # Initialization loss function value = 0 loss = 0 # Traverse three effective feature layers for layer in range(3): # -------------------------------------------------------------------- # # Take out the confidence c of whether all the real label boxes of a feature layer contain objects, with objects c=1 and without objects c=0 # The 5 of the last dimension in the characteristic layer (b,13,13,num_anchors,5+num_classes) represents the confidence c of the central coordinate (x,y) width and height (w,h) # shape changes from [b,13,13,num_anchors,5+num_classes] to [b,13,13,num_anchors,1] # -------------------------------------------------------------------- # object_mask = y_true[layer][..., 4:5] # Take out the conditional probability of the category of objects contained in all real label boxes of a feature layer true_class_probs = y_true[layer][..., 5:] # ---------------------------------------------------- # # Decode a prediction output feature layer of the network to obtain the prediction frame information of the feature layer # anchors[anchors_mask[layer]] represents the (h,w) of three a priori frames of a feature layer # raw_pred: a feature layer after adjustment, [b, 13, 13, num_anchors * (5 + num_classes)] = > [b, 13, 13, num_anchors, (5 + num_classes)] # box_xy: coordinates of the center point of the decoded prediction frame, normalized # box_wh: width and height of decoded prediction frame, normalized # ---------------------------------------------------- # raw_pred, box_xy, box_wh = anchors_decode(outputs[layer], anchors[anchors_mask[layer]], num_classes, input_shape) # ---------------------------------------------------- # # Save the location information of the prediction box generated by each grid together # box_xy: [b, 13, 13, 3, 2], box_wh: [b, 13, 13, 3, 2] # pred_box: [b, 13, 13, 3, 4] # ---------------------------------------------------- # pred_box = tf.concat([box_xy, box_wh], axis=-1) ''' Loss In calculation, It mainly contains positive samples, Negative sample, And not participate in the calculation loss Part of. # ---------------------------------------------------- # # Positive sample: responsible for forecasting target: # First, calculate the grid on which the center point of the target falls, then calculate the 9 a priori boxes of the grid and the IOU value of the real position of the target, and take the a priori box with the largest IOU value to match the target. The rest of the a priori boxes are not responsible. # ---------------------------------------------------- # # Negative sample: represents image background: # Calculate the IOU between each a priori frame and all target real frames. If the maximum IOU of a priori frame and all objects in the image is less than the threshold (generally 0.5), then the a priori frame is considered to contain no target and is recorded as a negative sample, and its confidence should be 0 # ---------------------------------------------------- # # Do not participate in the calculation part: IOU exceeds the threshold ignore_thresh, but not the biggest part # Although this part is not responsible for the prediction object, the IOU is large, which can be considered to include a part of the target and can not be simply regarded as a negative sample, so this part does not participate in the error calculation. ''' # ---------------------------------------------------- # # Calculate the iou of prediction box and label box # pred_box and true_ Shape of box = [b, 13,13,3,4] # Shape of output iou = [b, 13,13,3,1] # Find out the maximum value of iou in the three prediction boxes of each grid and the label box # best_ The shape=[b,13,13,3] of iou represents 13 * 13 grids, and each grid contains the maximum iou of three prediction frames # ---------------------------------------------------- # # Take out the position information of all label boxes (x,y,w,h) true_box = y_true[layer][..., 0:4] # Calculate the intersection union ratio of two frames iou = IOU(pred_box, true_box) # Find out the maximum iou of the prediction box and the real box in the three a priori boxes of each grid best_iou = tf.reduce_max(iou, axis=-1) # ---------------------------------------------------- # # 13,13, negative part [b] # Judge whether the maximum iou of the two frames is less than the threshold. If it is less than the threshold, it is considered that the prediction frame has no corresponding real frame # If the maximum IOU of a priori frame and all objects in the image is less than the threshold (generally 0.5), it is considered that the priori frame does not contain targets and is recorded as a negative sample, and its confidence should be 0. # Record the negative sample box and convert it from bool type to float type # ---------------------------------------------------- # ignore_mask = tf.cast(best_iou < ignore_thresh, dtype=true_box.dtype) # ignore_mask is used to extract all negative samples and boxes that do not participate in the calculation, and dimension expansion # [b,13,13,3]==>[b,13,13,3,1] ignore_mask = tf.expand_dims(ignore_mask, axis=-1) # ---------------------------------------------------- # # Calculate the ciou loss, that is, the target bounding box positioning loss # ---------------------------------------------------- # # Calculate the Ciou intersection and union ratio of prediction box and label box [b,13,13,3,1] ciou = CIOU(pred_box, true_box) # Calculate ciou loss, prediction frame confidence * (1-ciou intersection and union ratio) ciou_loss = object_mask * (1-ciou) # Sum of ciou loss values location_loss = tf.reduce_sum(ciou_loss) # ---------------------------------------------------- # # Calculate confidence loss [b,13,13,3,1] # ---------------------------------------------------- # # object_mask represents the confidence degree c of whether all real label boxes of a feature layer contain objects. With objects, c=1 and without objects, c=0 # raw_pred[4:5] represents the confidence of all prediction frames in a feature layer # ---------------------------------------------------- # #(1) If there is a label box in this position, calculate the cross entropy of confidence between 1 and prediction box #(2) If there is no label box in this position, the cross entropy between 0 and the confidence of the prediction box is calculated # In (2), some samples will also be deleted, and these ignored samples meet the condition of best_ iou<ignore_ thresh # ---------------------------------------------------- # confidence_loss = object_mask * tf.keras.backend.binary_crossentropy(object_mask, raw_pred[..., 4:5], from_logits=True) + \ (1 - object_mask) * tf.keras.backend.binary_crossentropy(object_mask, raw_pred[..., 4:5], from_logits=True) * ignore_mask # ---------------------------------------------------- # # Calculate category loss [b,13,13,3,num_classes] # If the location has a label box, calculate the category probability of each object in the real prediction box and the category probability of each object in the prediction box to obtain the cross entropy # ---------------------------------------------------- # class_loss = object_mask * tf.keras.backend.binary_crossentropy(true_class_probs, raw_pred[..., 5:], from_logits=True) # ---------------------------------------------------- # # Calculate the total number of positive and negative samples # Number of positive samples [b, 13,13,3,1] = > [number of all label boxes with confidence of 1]. If there is no positive sample, it will be 1 # Similarly, the number of negative samples, the number of all label boxes with confidence of 0, also includes the best part that does not participate in the calculation_ iou<ignore_ Thresh # ---------------------------------------------------- # num_position = tf.maximum(tf.reduce_sum(tf.cast(object_mask, tf.float32)), 1) num_negative = tf.maximum(tf.reduce_sum(tf.cast((1-object_mask) * ignore_mask, tf.float32)), 1) # ---------------------------------------------------- # # Calculate the total loss according to the formula, positioning loss + confidence loss + category loss # ---------------------------------------------------- # location_loss = location_loss * box_ratio / num_position confidence_loss = tf.reduce_sum(confidence_loss) * balance[layer] * obj_ratio / (num_position + num_negative) class_loss = tf.reduce_sum(class_loss) * cls_ratio / num_position / num_classes # Loss summation loss = loss + location_loss + confidence_loss + class_loss # Return the loss value of three characteristic layers return loss

Construct the network output layer and real label, and verify the code

if __name__ == '__main__': # Three effective feature layers of network output feat1 = tf.fill([4, 16, 16, 3*25], 50.0) feat2 = tf.fill([4, 16, 16, 3*25], 50.0) feat3 = tf.fill([4, 16, 16, 3*25], 50.0) # Real label box true1 = tf.fill([4, 16, 16, 3, 25], 40.0) true2 = tf.fill([4, 16, 16, 3, 25], 40.0) true3 = tf.fill([4, 16, 16, 3, 25], 40.0) # Combined input features features = [feat1, feat2, feat3, true1, true2, true3] # Enter the dimension of feature drawing [h,w] input_shape = [416,416] # A priori frame anchors = np.array([[12, 16], [19, 36], [40, 28], [36, 75], [76, 55], [72, 146], [142, 110], [192, 243], [459, 401]]) # Index of three prior boxes corresponding to each feature layer anchors_mask = [[6,7,8], [3,4,5], [0,1,2]] # Calculate loss loss = yolo_loss(features, input_shape, anchors, anchors_mask, num_classes=20) print(loss) # tf.Tensor(-994.3477, shape=(), dtype=float32)