If you have any questions, please point them out.

On the previous blog: Yolov3 spp series | training Pascal voc format data set On the basis of, after you can run through, you can debug and understand the code step by step. For the whole yolov3spp training stage, the most important thing is the realization of positive and negative sample matching. So, next, I will record yolov3spp's positive and negative sample matching implementation.

In yolov3 spp's positive and negative sample matching and loss calculation, it is mainly in train_ one_ There are two operations under the function Epoch:

- build_targets: it is responsible for the matching of positive and negative samples. The parameters passed in are also predictions, targets and model
- compute_loss: responsible for loss calculation. The parameters passed in are predictions, targets and model

In the whole yolov3-spp project, the most critical and important part is this part. The main idea is to build first_ Match the positive and negative samples of targets, and then calculate_ Loss calculation. So build_ The targets function is included in compute_ In the loss function, the following comments are given to the other function.

# 1. Positive and negative sample matching

build_targets function code: it is used to filter positive samples for each anchor corresponding to the three feature map s output by YOLO Layer, that is, positive and negative samples match, and then use these positive samples for training and loss calculation.

## 1.1 treatment process diagram

Finally, there will be an analysis process

## 1.2 build_targets code

The code is as follows:

def build_targets(p, targets, model): """ Build targets for compute_loss(), input targets(image,class,x,y,w,h) :param p: The prediction frame is composed of yolo_out Three returned yolo Layer output tensor format list The list stores three tensor Corresponding to three yolo Layer output For example:[4, 3, 23, 23, 25] [4, 3, 46, 46, 25] [4, 3, 96, 96, 25] (736x736 Under scale) [batch_size, anchor_num, grid, grid, xywh + obj + classes] p[i].shape :param targets: After data enhancement batch Real box [21, 6] 21: num_object 6: batch Which picture in (0),1,2,3)+category+x+y+w+h Real box :param model: Initialized model :return: tbox: append [m(Number of positive samples), x Offset(The coordinate of the center point is relative to the center grid_cell Offset of upper left corner) + y Offset + w + h] Store the current batch All in anchor A positive sample of anchor Positive samples refer to the current target By this anchor forecast In addition, the same target May consist of multiple anchor Prediction, so usually m>nt indices: append [m(Number of positive samples), b + a + gj + gi] b: and tbox One by one tbox Corresponding position in target(The first a individual anchor (positive sample of) belongs to this batch Which picture in a: and tbox One by one tbox Corresponding position in target Which one does it belong to anchor(index)Positive sample of (by which) anchor (responsible for forecasting) gj: and tbox One by one tbox Corresponding position in target The center point of the grid_cell Top left corner of y coordinate gi: and tbox One by one tbox Corresponding position in target(The first a individual anchor The center point of (positive sample) grid_cell Top left corner of x coordinate tcls: append [m] and tbox One by one tbox Corresponding position in target(The first a individual anchor Category to which (positive sample) belongs anch: append [m, 2] and tbox One by one tbox Corresponding position in target Which one does it belong to anchor(shape)Positive sample of (by which) anchor (responsible for forecasting) """ nt = targets.shape[0] # Number of real boxes in the current batch num of target # Define some variables # anch append [m, 2] and tbox correspond to each other one by one. The target in the corresponding position is the positive sample of which anchor(shape) (which anchor is responsible for prediction) tcls, tbox, indices, anch = [], [], [], [] gain = torch.ones(6, device=targets.device) # normalized to gridspace gain tensor([1,1,1,1,1,1]) multi_gpu = type(model) in (nn.parallel.DataParallel, nn.parallel.DistributedDataParallel) # General False for i, j in enumerate(model.yolo_layers): # [89, 101, 113] i,j = 0, 89 1, 101 2, 113 # Get the size of anchors corresponding to the yolo predictor, but here is the shape = [3,2] scaled to the feature map # And it is obtained by dividing the anchors in the cfg file by the scaling scale stripe # For example: [3.6250, 2.8125] / 13 * 416 (scale 32) = [116,90] is equal to the size of anchor in cfg file anchors = model.module.module_list[j].anchor_vec if multi_gpu else model.module_list[j].anchor_vec # The dimension information of feature map is stored in gain # When the scale of the original drawing is (736736), p[0]=[4,3,23,23,25] p[1]=[4,3,46,46,25] p[2]=[4,3,92,92,25] # As shown in the original figure (736x736) gain = tensor ([1,1,23,23,23,23]) or tensor ([1,1,46,46,46]) or tensor ([1,1,92,92,92]) gain[2:] = torch.tensor(p[i].shape)[[3, 2, 3, 2]] # xyxy gain na = anchors.shape[0] # Number of anchors 3 # [3] -> [3, 1] -> [3, nt] # anchor tensor, same as .repeat_interleave(nt) at.shape=[3,21] 21 0, 1, 2 at = torch.arange(na).view(na, 1).repeat(1, nt) # Match targets to anchors # t = targets * gain: convert the box coordinates (normalized in box label generation, i.e. divided by the width and height of the image) to the feature map output by the current yolo layer # By multiplying the normalized box by the scale of the feature map, the box coordinates are projected onto the feature map # Broadcast principle targets = [21,6] gain = [6] = > gain = [6,6] = > t = [21,6] a, t, offsets = [], targets * gain, 0 if nt: # If target exists # iou the wh (anchors) corresponding to the anchor of yolo layer on the feature map and the wh(t[4:6]) corresponding to all predicted real frames on the feature map, # If greater than model Hyp ['iou_t '] = 0.2, the positive sample is retained, otherwise the negative sample is discarded # anchors: [3, 2]: the three anchors of the current yolO layer (and they are relatively 416x416, but the wh of the initial box is mostly OK. Anyway, they are all regressed in the end) # t[:, 4:6]: [nt, 2]: all target s are really w and h of the box and are relative to the current feature map # j: [3, nt] j = wh_iou(anchors, t[:, 4:6]) > model.hyp['iou_t'] # iou(3,n) = wh_iou(anchors(3,2), gwh(n,2)) # t.repeat(na, 1, 1): [nt, 6] -> [3, nt, 6] # Get the corresponding information of anchor and target whose iou is greater than the threshold # a=tensor[30]: anchor_index(0, 1, 2) 0 indicates that it is a positive sample belonging to the first anchor (including 4 pictures). Similarly, the second... # Then explain what is a positive sample: it means that the current target can be detected by the ith anchor, which means that the current target is a positive sample of this anchor # t=tensor[30,6]: target, class, x, y, W, H (relative to the current feature map scale) of 0, 1, 2, 3 (4 pictures) # It corresponds to a variable one by one. A is used to indicate which anchor the target of the corresponding position in t belongs to # Note: the same target here may be a positive sample belonging to multiple anchors. Multiple anchors calculate the same target # Otherwise, the number of t samples will not be greater than the number of positive samples (30 > 21) a, t = at[j], t.repeat(na, 1, 1)[j] # The filter selects all anchor s corresponding to their positive samples # Define # b: The index of the corresponding picture, that is, which picture the current target belongs to # c: Which class does the current target belong to # long is equal to to(torch.int64). The value is rounded down. Here are integers. long() only plays the role of float - > int b, c = t[:, :2].long().T # image, class gxy = t[:, 2:4] # grid xy corresponds to the xy coordinate of the target of the current feature map gwh = t[:, 4:6] # grid wh corresponds to the wh coordinate of the target of the current feature map # Match the coordinates of the upper left corner of the grid cell where the targets are located # (gxy-0) Long round down to get the coordinates of the upper left corner of the center point of the current target gij = (gxy - offsets).long() # x, y coordinates of the upper left corner of grid XY indexes gi, gj = gij.T # Append indices.append((b, a, gj, gi)) # image index, anchor, grid indices(x, y) tbox.append(torch.cat((gxy - gij, gwh), 1)) # x,y offset of gt box relative to the current feature map and w,h anch.append(anchors[a]) # anchors tcls.append(c) # class if c.shape[0]: # if any targets # The label value of the target cannot be greater than the given number of target categories assert c.max() < model.nc, 'Model accepts %g classes labeled from 0-%g, however you labelled a class %g. ' \ 'See https://github.com/ultralytics/yolov3/wiki/Train-Custom-Data' % ( model.nc, model.nc - 1, c.max()) return tcls, tbox, indices, anch # Here, the iou calculation processing is performed according to the width and height (upper left corner alignment) # It is different from ordinary IoU and GIoU def wh_iou(wh1, wh2): """ hold yolo Layered anchor In this feature map Corresponding on wh(anchors)And all predicted real boxes in this feature map Corresponding on wh(t[4:6])do iou, If greater than model.hyp['iou_t']=0.2, If yes, the positive sample will be retained; otherwise, the negative sample will be discarded and screened to meet the requirements yolo Positive sample corresponding to layer Args: wh1: anchors [3, 2] wh2: target [22,2] Returns: wh1 and wh2 of iou [3, 22] """ # Returns the nxm IoU matrix. wh1 is nx2, wh2 is mx2 wh1 = wh1[:, None] # [N,1,2] [3, 1, 2] wh2 = wh2[None] # [1,M,2] [1, 22, 2] inter = torch.min(wh1, wh2).prod(2) # [N,M] [3, 22] return inter / (wh1.prod(2) + wh2.prod(2) - inter) # iou = inter / (area1 + area2 - inter)

## 1.3 brief analysis

In build_ In the targets function, category information, bounding box information, anchor information matching with gt (the width, height and size of anchor) and other information (such as the category selected by anhcor, the first picture of the current batch, and the xy mark corresponding to grid cell) are mainly obtained according to the tracer. The roughly detailed process can also be seen from the processing process diagram in the above figure.

The important process here is that the relative coordinates stored in target[2:] are 0 ~ 1, so you can use the current prediction feature layer p to scale the target to the dimension of the current feature layer, that is, the key code: a, t, offsets = [], targets * gain, 0.

Traverse the prediction feature layers of each layer. Each layer contains three different sizes of anchors, which are set in the cfg file. For the three anchors set in the current feature layer, conduct a wide and high overlap rate test operation with the just scaled target information (implemented by wh_iou function), and filter out those targets greater than the threshold, which are called t, and the matching anchor index list is called a.

Then, based on T, we can obtain the xy coordinates of the center point of target, t[:, 2:4], and wh width and height, t[:, 4:]. Since the grid in the xy coordinate is predicted by the anchor of that point, the coordinate gij of the upper left corner of the grid cell where the matching targets are located is obtained by removing the decimal point from the central point. Therefore, the anhcor matched in the grid cell coordinate point will be used for prediction, and the decimal just removed is the offset to be predicted. What the network needs to predict is the offset and width and height.

Generally speaking, it returns the offset between the category information to be predicted and the bounding box according to the target, as well as the information matching the anchor used with the predicted feature layer.

# 2. Loss calculation

compute_loss function code: the function is to calculate the loss

## 2.1 treatment process diagram

Finally, there will be an analysis process

## 2.2 compute_loss code

The code is as follows:

def compute_loss(p, targets, model): """ Calculated loss: all anchor Loss of positive samples Args: p: predictions The prediction frame is composed of yolo_out Three returned yolo Layer output tensor format list The list stores three tensor Corresponding to three yolo Layer output For example:[4, 3, 23, 23, 25] [4, 3, 46, 46, 25] [4, 3, 96, 96, 25] [batch_size, anchor_num, grid, grid, xywh + obj + classes] It can be seen that the predicted values here are three yolo Each layer grid_cell(each grid_cell There are three predictions)Predicted value of,Positive sample screening must be carried out later targets: Real box after data enhancement [21, 6] 21: num_object 6: batch Which picture in (0),1,2,3)+category+x+y+w+h Real box [22, 6] model: Initialization model Returns: lbox: Position loss tensor([1]) obj_loss: Confidence loss tensor([1]) class_loss: Classification loss tensor([1]) """ device = p[0].device lcls = torch.zeros(1, device=device) # Tensor([0.]) lbox = torch.zeros(1, device=device) # Tensor([0.]) lobj = torch.zeros(1, device=device) # Tensor([0.]) ''' Build targets for compute_loss(), input targets(image,class,x,y,w,h) tbox: append [m(Number of positive samples), x Offset(The coordinate of the center point is relative to the center grid_cell Offset of upper left corner) + y Offset + w + h] Store the current batch All in anchor A positive sample of anchor Positive samples refer to the current target By this anchor forecast In addition, the same target May consist of multiple anchor Prediction, so usually m>nt indices: append [m(Number of positive samples), b + a + gj + gi] b: and tbox One by one tbox Corresponding position in target(The first a individual anchor (positive sample of) belongs to this batch Which picture in a: and tbox One by one tbox Corresponding position in target Which one does it belong to anchor(index)Positive sample of (by which) anchor (responsible for forecasting) gj: and tbox One by one tbox Corresponding position in target The center point of the grid_cell Top left corner of y Coordinates (real box) x Integer part of) gi: and tbox One by one tbox Corresponding position in target(The first a individual anchor The center point of (positive sample) grid_cell Top left corner of x Coordinates (real box) y Integer part of) tcls: append [m] and tbox One by one tbox Corresponding position in target(The first a individual anchor Category to which (positive sample) belongs anch: append [m, 2] and tbox One by one tbox Corresponding position in target Which one does it belong to anchor(shape)Positive sample of (by which) anchor (responsible for forecasting) ''' tcls, tbox, indices, anchors = build_targets(p, targets, model) # targets gets positive samples of all anchor s h = model.hyp # hyperparameters red = 'mean' # Loss reduction (sum or mean) # Initialize the cross entropy loss function: classification loss BCEcls confidence BCEobj if Focal loss is not used, BCEWithLogitsLoss is used # Bcewithlogitsloss = sigmoid the output first, and then cross entropy the output and target # pos_weight: used to set the class weight of loss and alleviate the imbalance of samples # reduction: set to "sum" to sum the loss of samples; Set to "mean" to mean the average loss of the sample; Set to "none" to calculate the loss of samples one by one, and the output is the same as the input shape. BCEcls = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['cls_pw']], device=device), reduction=red) BCEobj = nn.BCEWithLogitsLoss(pos_weight=torch.tensor([h['obj_pw']], device=device), reduction=red) # class label smoothing https://arxiv.org/pdf/1902.04103.pdf eqn 3 # BCCP: smooth label cp, cn = smooth_BCE(eps=0) # The smoothing coefficient eps=0 indicates that label smoothing is not used, which is usually 0.1 # focal loss is not used here for the time being. It is generally used in classified loss g = h['fl_gamma'] # g default = 0 in focal loss gamma HPY if g > 0: BCEcls, BCEobj = FocalLoss(BCEcls, g), FocalLoss(BCEobj, g) # Traverse each yolo layer output nt = 0 # targets # layer index(0,1,2), layer predictions [[batch_size, anchor_num, grid, grid, xywh + obj + classes]] for i, pi in enumerate(p): # pi is the output of layer i yolo [4,3,23,23,25] # Obtain all positive samples (three anchor s) information of each layer # b: One to one correspondence with tbox stores the target (positive sample of the a-th anchor) at the corresponding position in tbox. Which picture belongs to this batch # a: One to one correspondence with tbox stores the positive sample of which anchor(index) the target at the corresponding position in tbox belongs to (which anchor is responsible for prediction) # gj: it corresponds to tbox one by one and stores the grid where the center point of the target in the corresponding position in tbox is located_ y coordinate of the upper left corner of the cell # gi: it corresponds to tbox one by one and stores the grid where the center point of the target (positive sample of the a-th anchor) at the corresponding position in tbox is located_ The x coordinate of the upper left corner of the cell b, a, gj, gi = indices[i] tobj = torch.zeros_like(pi[..., 0], device=device) # All target obj [4,3,23,23] nb = b.shape[0] # The number of positive samples of m all anchors in target is not the number of real boxes, but should be greater than the number of real boxes, because a real box may be predicted in multiple anchors if nb: # Corresponding to the prediction information matched to the positive sample # pi=[batch_size, anchor_num, grid, grid, xywh + obj + classes] # ps=[m, xywh + obj + classes] where m is the number of positive samples, which actually corresponds to the number of all positive samples in the target # ps is actually the value of all positive sample positions relative to the target in the prediction information pi, which is convenient for loss calculation ps = pi[b, a, gj, gi] # prediction subset corresponding to targets [30,25] # lbox: location loss GIoU Loss pxy = ps[:, :2].sigmoid() pwh = ps[:, 2:4].exp().clamp(max=1E3) * anchors[i] pbox = torch.cat((pxy, pwh), 1) # predicted box giou = bbox_iou(pbox.t(), tbox[i], x1y1x2y2=False, GIoU=True) # giou(prediction, target) lbox += (1.0 - giou).mean() if red == "mean" else (1.0 - giou).sum() # giou loss # obj model.gr=1 is set as giou (where there are objects, the confidence is giou) model GR in train Defined in PY # model.gr: giou loss ratio (obj_loss = 1.0 or giou) # model.gr=1 obj_loss=giou; model.gr=0, obj_loss=1 tobj[b, a, gj, gi] = (1.0 - model.gr) + model.gr * giou.detach().clamp(0).type(tobj.dtype) # giou formula # lcls: classification loss of all positive samples if model.nc > 1: # cls loss (only if multiple classes) # Cn: negative label (negative sample value in label smooth) [m, 20] all cn t = torch.full_like(ps[:, 5:], cn, device=device) # targets # NB: m TCLs [i]: category position of real box CP: positive label (positive sample value in label smooth) # The position corresponding to the category of the real box is set to cp, and the position of other categories is set to cn t[range(nb), tcls[i]] = cp # BCE class loss if focal loss is opened, focal loss will be called here lcls += BCEcls(ps[:, 5:], t) # Append targets to text file # with open('targets.txt', 'a') as file: # [file.write('%11.5g ' * 4 % tuple(x) + '\n') for x in torch.cat((txy[i], twh[i]), 1)] # lobj: confidence loss (with or without objects) lobj is calculated for all prediction areas, including positive samples and negative samples # In this way, the predicted value at the negative sample is close to 0 and the predicted value at the positive sample is close to 1 lobj += BCEobj(pi[..., 4], tobj) # obj loss # Multiply by the corresponding weight of each loss # Because the main loss of target detection comes from position loss, multiply the lbox by a small weight to balance each loss and prevent the loss function from being controlled by individual losses lbox *= h['giou'] # 3.54 lobj *= h['obj'] # 102.88 lcls *= h['cls'] # 9.35 # loss = lbox + lobj + lcls return {"box_loss": lbox, "obj_loss": lobj, "class_loss": lcls}

## 2.3 brief analysis

Why build_ The tracet function will output the parameter indice. What about you? Because it is related to the output of the feature layer, the output result of the prediction feature layer of a certain layer is: [batch, anchor, girdx, gridy, xywh+obj+cls], then according to the value contained in the indexes, ps = pi[b, a, gj, gi] can match the prediction information of the positive sample corresponding to hunger.

Now we know which point of the prediction feature layer is used for prediction. What we need to do is to use the 25 dimensional prediction information of this point to do a loss calculation. Since the target just output is the offset, PS [:,: 2] predicts the offset of xy: PXY = PS [:,: 2] sigmoid(). Moreover, ps[:, 2:4] prediction is the relative scaling size of the anchor used. Here, there will be exp function to process the scaling coefficient to ensure that it is greater than 0: PWH = ps[:, 2:4] exp(). clamp(max=1E3) * anchors[i].

For the prediction of the confidence of the fifth value, it is calculated for all prediction areas, including positive samples and negative samples. The purpose of this is to make the prediction value of the negative sample close to 0 and the prediction value of the positive sample close to 1, and the confidence here is expressed by the giou value matched to the point.

The classification loss is relatively simple, which will not be described here, but the implementation of the code is different, but the idea is the same. Finally, it is good to add a weight to the boundary box of the loss result. It should be noted that the main loss here is the boundary box regression loss, followed by the classification loss, and the confidence loss value is relatively small. Therefore, in terms of processing, it is necessary to multiply the confidence loss by a relatively large value, and the boundary box loss of the second team by a relatively small value. The set weight of yolo here is; {‘giou’: 3.54, ‘cls’: 9.35, ‘cls_pw’: 1.0, ‘obj’: 102.88,…}.

The three loss values multiplied by the weight are:

The three loss values multiplied by the weight are:

reference material:

- https://blog.csdn.net/qq_38253797/article/details/118046587 (the blogger commented in great detail. I suggest you take a look)
- https://www.bilibili.com/video/BV1t54y1C7ra?p=6 (or watch the video analysis of station b)