[paper notes] Dr TANet: dynamic receptive temporary attention network for street scene change detection


Thesis title: Dr TANet: dynamic receptive temporary attention network for street scene change detection

Included: IEEE Intelligent Vehicles Symposium 2021 (IV2021)

Thesis address: https://arxiv.org/abs/2103.00879v2

Project address: GitHub - Herrccc/DR-TANet


Street scene change detection (SCD) aims to identify the change areas of paired street scene images taken at different times. The state-of-the-art network based on encoder decoder architecture uses the characteristic diagram of the corresponding level between the two channels to obtain sufficient change information. However, the efficiency of feature extraction, feature correlation calculation and even the whole network need to be further improved.

In this paper, Temporal Attention Module (TAM) is proposed, and the influence of the dependence range of temporal attention on change detection performance is discussed. Based on the time attention module, an efficient and lightweight Dynamic Receptive Temporal Attention Module (DRTAM) is introduced, and Concurrent Horizontal and Vertical Attention (CHVA) is proposed to improve the accuracy of the network to specific challenges.

Excellent results were achieved in data sets PCD (GSV and TSUNAMI) and CL-CMU-CD.


For paired images, there are three common feature fusion methods: early fusion, late fusion and early fusion,
late-fusion and correlation-fusion).

Existing limitations:

  • The existing methods are limited by their conventional and limited sampling range, and it is difficult to deal with objects of various sizes and shapes. The relationship between characteristic graphs has research space. Consider introducing attention mechanisms.
  • With the deepening of SCD network, deeper network will inevitably lead to lower efficiency. Therefore, SCD network needs to further improve the balance between efficiency and performance.
  • The outline of change mask in most networks is rough, so more detailed change detection is needed.


  • A Temporal Attention Module (TAM) is proposed, which uses the attention mechanism to find similarity in a fixed range of dependence.
  • Considering that strip entities (pedestrians, branches and street lamps) are difficult to detect, the paper proposes Concurrent Horizontal and Vertical Attention (CHVA) for refinement.
  • A Dynamic Receptive Temporal Attention Module (DRTAM) is proposed to realize lightweight and make it suitable for intelligent vehicles.

In this paper, the network with basic time attention module (TAM) is called TANet;

The network with dynamic receiving time attention module (DRTAM) and refined by CHVA is called DR-TANet.


  • A new street scene change detection network TANet and its improved version DR-TANet are proposed. Firstly, the attention mechanism is introduced into the change detection task. Based on the time attention module (TAM), this paper explores the impact on performance by comparing different dependency range sizes, and then determines the best fixed dependency range size. In addition, due to the increasing receptive field of the feature map in the encoder, the dynamic receptive temporal attention module (DRTAM) is proposed intuitively. DRTAM performs well on public data sets with low parameter and computational complexity. It improves the efficiency of the whole network and is used as the basis of DR-TANet.
  • Concurrent horizontal and vertical attention (CHVA) is proposed and integrated into the temporal attention graph to improve the final change detection. CHVA plays an important role in the change of banded entities that need accurate positioning to ensure safe driving.
  • The performance of the proposed network on street view data sets PCD ("GSV" and "TSUNAMI") has achieved sota effect. In addition, it performs well on the data set "VL-CMU-CD".

Related Work

This paper briefly introduces the development of change detection methods and leads to Dr TANet.

This paper briefly discusses the correlation between features, and leads to the attention mechanism to learn the relationship between feature maps on time channel and space channel.

TANet/DR-TANet: Proposed Architecture

Overall Model Architecture 

The TANet architecture is shown in the figure. The encoder backbone is ResNet-18, which supports fast reasoning. The encoder path is divided into two parallel branches for feature extraction. The extracted feature map is injected into the time attention module TAM in order to find the similarity of the feature map of the corresponding level between the two time channels. Finally, the attention map generated in TAM is input to the decoder. The decoder consists of four up sampling layers, which are used to perform up sampling to recover the size of the required change mask.

Temporal Attention Module (TAM)

In the calculation process of time attention (TA), the characteristic graph of t1 channel will be used to generate query matrix. The query matrix will operate with the key matrix to generate a covariance matrix between each pixel of t1 and related pixels within the dependency range of t0. The covariance matrix after softmax operation provides the exact similarity between pixels, and then operates using the value matrix inferred from the characteristic diagram of t0 channel, so as to predict the change region, as shown in Fig. 4. Vividly, the image at t1 will query the changed area from the image at t0.  

If the dependency and similarity between each pixel of time channel t0 and all pixels of time channel t1 are detected, complete information can be obtained, but the global complete image attention is computationally expensive and not representative. In this paper, a fixed dependency range size is applied. This means that the example pixel (i, j) of the feature map at channel t0 will be regarded as a pixel that depends only on the fixed range of the feature map at channel t1, with its center at the same position (i, j). The size of the fixed dependency range can be (1 * 1), (3 * 3), (5 * 5), etc. The effects of different dependent ranges were later ablated.

In addition, carry out long time attention. It means that the characteristic graph is divided into N groups in the channel dimension to determine the relationship between channels.

Referring to the transformer model structure, the position coding PC is introduced to model not only the dependence and similarity of two time channels, but also the proximity of local positions. Specifically, the pixel position in the dependent range will be processed differently.

The attention module (TAM) is constructed on the basis of time. Tam consists of four layers. Each layer takes the characteristic map of two time channels as the input, and calculates the attention map according to the TA mechanism. Due to the different size of the feature map, the previously calculated attention map will be down sampled first, and then spliced with the subsequent attention map. Each attention map will be inserted into the decoder again through skip connection to prevent the loss of information in the whole up sampled feature stream.

Dynamic Receptive Temporal Attention Module (DRTAM)

TAM is based on the fixed dependent range size of the attention module.

In the visual model, during the down sampling path of the encoder, the feature map first identifies the low-level entities (color, edge, etc.), and then the high-level entities (texture, shape, object, etc.) will receive more attention.

The computing power is not fully utilized while keeping the dependence range of the feature graph under continuous down sampling unchanged.  

Therefore, this paper proposes a dynamic receiving time attention module (DRTAM), as shown in Figure 5. When the feature map collects low-level features, in the initial time attention layer, the dependency range will be set to 7 * 7. Then, in the later time attention layer, the dependence range is 5 * 5, 3 * 3 and 1 * 1 in turn. Thus, the computing power is highly utilized, and sufficient domain similarity and dependency are collected at the same time.

Refinement: Concurrent Horizontal and Vertical Attention (CHVA)

Convolution in CNN usually defines a square around the central pixel centered on the pixel as the relevant field.

By considering the shape of dependent range in TA, this paper designs parallel horizontal and vertical attention (CHVA) to obtain more information about the two directions of thinning strip practice change. The horizontal and vertical lengths are set to (2*K+1), and K indicates the size of the time attention dependent range.




dataset: PCD ('GSV 'and' TSUNAMI '), VL-CMU-CD

Data preprocessing

Experimental details

Evaluation indicators: precision, recall, F1 score

Evaluation on 'PCD' Dataset

On the PCD dataset, Dr TANet obtains sota.  


It can be found that DR-TANet can capture more detailed and small changes.

Illustration of the time attention (TA) diagram (overlapping with ground true). (a)(b) the 27th TA drawing showing the first floor of the TA; (c)(d) visualize the 60th TA drawing on the first floor of TA. In the first layer of TA, 64 attention maps were generated. In the 27th TA map, the highlighted area represents the area attended. On the contrary, in the 60th TA figure, the darker areas get more attention.  

Ablation Studies

Effect of the TA dependency-scope size in TAM

The influence of TA dependence range in TAM. For each pixel in the 0-channel feature map, he determines the pixel range centered on the same pixel position in the t1 channel feature map. The larger the scope, the greater the inferred dependency and the more detailed the information contained.

As the size of the dependency range increases, the F1 value increases.

Effect of Concurrent Horizontal and V ertical Attention (CHVA) 

Concurrent horizontal and vertical attention effects. With the introduction of CHVA, the F1 value increased slightly. The motivation for the introduction of CHVA is to include a wider range of information in the horizontal and vertical directions. The estimation of specific image pairs is improved, but it has little effect on the F1 value of the whole data set.

It can be seen from the dotted box in the figure below that the estimation of strip entities using CHVA has been significantly improved.

Effect of the Dynamic Receptive Temporal Attention Module (DRTAM)

Role of DRTAM module. There is little improvement in performance, which is mainly reflected in reducing the complexity of the model.

Parameters and MACs Comparison between Different Configurations and Different Networks

Comparison of parameters and MAC of different configurations and different networks.

Firstly, the parameters and MAC of TA with different configurations are compared to give efficiency oriented suggestions. Then, the parameters and MAC of different networks with good performance are compared to prove that DR-TANet also has its advantages from this point of view.

Note that the mechanism, the size of dependency range and the specification of CHVA have no effect on the amount of network parameters. DRTAM+CHVA is recommended for the most efficient configuration. It implements a dependency size of (7 × 7) The TAM has almost the same result, but its computational resource cost is even lower than the range size (3) × 3) Tam.  

Compared with the other three networks with the best performance, the computational complexity is compared with the proposed DR-TANet. It can be seen that the efficiency of DR-TANet in parameters and MAC is significantly improved.

Evaluation on 'VL-CMU-CD' Dataset

  • Dr TANet performs well on CL-CMU-CD.
  • Dr TANet without CHVA refinement has even better performance than Dr TANet with CHVA. Part of the reason is that the changes in the "VL-CMU-CD" dataset are not so detailed. In this case, there will be no significant difference in CHVA integrating horizontal and vertical strip information.



  • This paper introduces the attention mechanism into the street scene change detection task SCD. The introduction of temporal attention (TA) aims to make use of the similarity and dependence of characteristic graphs on two time channels.
  • Based on TAM, a dynamic receptive time attention module (DRTAM) is proposed, which not only ensures high performance, but also reduces the computational complexity.
  • Concurrent horizontal and vertical attention (CHV A) is introduced, which refines the prediction of strip entity changes.
  • Experiments on "VL-CMU-CD" and "PCD" data sets show that Dr TANet has superior performance. Considering the parameters and calculation requirements of the whole network, the network has achieved a good balance between accuracy and efficiency.  


Core code


class Temporal_Attention(nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size=1, stride=1, padding=0,
                 groups=1, bias=False, refinement=False):
        super(Temporal_Attention, self).__init__()
        self.outc = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.groups = groups
        self.refinement = refinement

        print('Attention Layer-kernel size:{0},stride:{1},padding:{2},groups:{3}...'.format(self.kernel_size,self.stride,self.padding,self.groups))
        if self.refinement:
            print("Attention with refinement...")

        assert self.outc % self.groups == 0, 'out_channels should be divided by groups.'

        self.w_q = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=bias)
        self.w_k = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=bias)
        self.w_v = nn.Conv2d(in_channels, out_channels, kernel_size=1, bias=bias)

        #relative positional encoding...
        self.rel_h = nn.Parameter(torch.randn(self.outc // 2, 1, 1, self.kernel_size, 1), requires_grad = True)
        self.rel_w = nn.Parameter(torch.randn(self.outc // 2, 1, 1, 1, self.kernel_size), requires_grad = True)
        init.normal_(self.rel_h, 0, 1)
        init.normal_(self.rel_w, 0, 1)

        init.kaiming_normal_(self.w_q.weight, mode='fan_out', nonlinearity='relu')
        init.kaiming_normal_(self.w_k.weight, mode='fan_out', nonlinearity='relu')
        init.kaiming_normal_(self.w_v.weight, mode='fan_out', nonlinearity='relu')

    def forward(self, feature_map):

        fm_t0, fm_t1 = torch.split(feature_map, feature_map.size()[1]//2, 1)
        assert fm_t0.size() == fm_t1.size(), 'The size of feature maps of image t0 and t1 should be same.'

        batch, _, h, w = fm_t0.size()

        padded_fm_t0 = F.pad(fm_t0, [self.padding, self.padding, self.padding, self.padding])
        q_out = self.w_q(fm_t1)
        k_out = self.w_k(padded_fm_t0)
        v_out = self.w_v(padded_fm_t0)

        if self.refinement:

            padding = self.kernel_size
            padded_fm_col = F.pad(fm_t0, [0, 0, padding, padding])
            padded_fm_row = F.pad(fm_t0, [padding, padding, 0, 0])
            k_out_col = self.w_k(padded_fm_col)
            k_out_row = self.w_k(padded_fm_row)
            v_out_col = self.w_v(padded_fm_col)
            v_out_row = self.w_v(padded_fm_row)

            k_out_col = k_out_col.unfold(2, self.kernel_size * 2 + 1, self.stride)
            k_out_row = k_out_row.unfold(3, self.kernel_size * 2 + 1, self.stride)
            v_out_col = v_out_col.unfold(2, self.kernel_size * 2 + 1, self.stride)
            v_out_row = v_out_row.unfold(3, self.kernel_size * 2 + 1, self.stride)

        q_out_base = q_out.view(batch, self.groups, self.outc // self.groups, h, w, 1).repeat(1, 1, 1, 1, 1, self.kernel_size*self.kernel_size)
        q_out_ref = q_out.view(batch, self.groups, self.outc // self.groups, h, w, 1).repeat(1, 1, 1, 1, 1, self.kernel_size * 2 + 1)
        k_out = k_out.unfold(2, self.kernel_size, self.stride).unfold(3, self.kernel_size, self.stride)

        k_out_h, k_out_w = k_out.split(self.outc // 2, dim=1)
        k_out = torch.cat((k_out_h + self.rel_h, k_out_w + self.rel_w), dim=1)

        k_out = k_out.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)

        v_out = v_out.unfold(2, self.kernel_size, self.stride).unfold(3, self.kernel_size, self.stride)
        v_out = v_out.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)

        inter_out = (q_out_base * k_out).sum(dim=2)

        out = F.softmax(inter_out, dim=-1)
        out = torch.einsum('bnhwk,bnchwk -> bnchw', out, v_out).contiguous().view(batch, -1, h, w)

        if self.refinement:

            k_out_row = k_out_row.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)
            k_out_col = k_out_col.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)
            v_out_row = v_out_row.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)
            v_out_col = v_out_col.contiguous().view(batch, self.groups, self.outc // self.groups, h, w, -1)

            out_row = F.softmax((q_out_ref * k_out_row).sum(dim=2),dim=-1)
            out_col = F.softmax((q_out_ref * k_out_col).sum(dim=2),dim=-1)
            out += torch.einsum('bnhwk,bnchwk -> bnchw', out_row, v_out_row).contiguous().view(batch, -1, h, w)
            out += torch.einsum('bnhwk,bnchwk -> bnchw', out_col, v_out_col).contiguous().view(batch, -1, h, w)

        return out


class TANet(nn.Module):

    def __init__(self, encoder_arch, local_kernel_size, stride, padding, groups, drtam, refinement):
        super(TANet, self).__init__()

        self.encoder1, channels = get_encoder(encoder_arch,pretrained=True)
        self.encoder2, _ = get_encoder(encoder_arch,pretrained=True)
        self.attention_module = get_attentionmodule(local_kernel_size, stride, padding, groups, drtam, refinement, channels)
        self.decoder = get_decoder(channels=channels)
        self.classifier = nn.Conv2d(channels[0], 2, 1, padding=0, stride=1)
        self.bn = nn.BatchNorm2d(channels[0])
        self.relu = nn.ReLU(inplace=True)

    def forward(self, img):

        img_t0,img_t1 = torch.split(img,3,1)
        features_t0 = self.encoder1(img_t0)
        features_t1 = self.encoder2(img_t1)
        features = features_t0 + features_t1
        features_map = self.attention_module(features)
        pred_ = self.decoder(features_map)
        pred_ = upsample(pred_,[pred_.size()[2]*2, pred_.size()[3]*2])
        pred_ = self.bn(pred_)
        pred_ = upsample(pred_,[pred_.size()[2]*2, pred_.size()[3]*2])
        pred_ = self.relu(pred_)
        pred = self.classifier(pred_)

        return pred


class AttentionModule(nn.Module):

    def __init__(self, local_kernel_size = 1, stride = 1, padding = 0, groups = 1,
                 drtam = False, refinement = False, channels = [64,128,256,512]):
        super(AttentionModule, self).__init__()

        if not drtam:
            self.attention_layer1 = Temporal_Attention(channels[0], channels[0], local_kernel_size, stride, padding, groups, refinement=refinement)
            self.attention_layer2 = Temporal_Attention(channels[1], channels[1], local_kernel_size, stride, padding, groups, refinement=refinement)
            self.attention_layer3 = Temporal_Attention(channels[2], channels[2], local_kernel_size, stride, padding, groups, refinement=refinement)
            self.attention_layer4 = Temporal_Attention(channels[3], channels[3], local_kernel_size, stride, padding, groups, refinement=refinement)
            self.attention_layer1 = Temporal_Attention(channels[0], channels[0], 7, 1, 3, groups, refinement=refinement)
            self.attention_layer2 = Temporal_Attention(channels[1], channels[1], 5, 1, 2, groups, refinement=refinement)
            self.attention_layer3 = Temporal_Attention(channels[2], channels[2], 3, 1, 1, groups, refinement=refinement)
            self.attention_layer4 = Temporal_Attention(channels[3], channels[3], 1, 1, 0, groups, refinement=refinement)

        self.downsample1 = conv3x3(channels[0], channels[1], stride=2)
        self.downsample2 = conv3x3(channels[1]*2, channels[2], stride=2)
        self.downsample3 = conv3x3(channels[2]*2, channels[3], stride=2)

    def forward(self, features):

        features_t0, features_t1 = features[:4], features[4:]

        fm1 = torch.cat([features_t0[0],features_t1[0]], 1)
        attention1 = self.attention_layer1(fm1)
        fm2 = torch.cat([features_t0[1], features_t1[1]], 1)
        attention2 = self.attention_layer2(fm2)
        fm3 = torch.cat([features_t0[2], features_t1[2]], 1)
        attention3 = self.attention_layer3(fm3)
        fm4 = torch.cat([features_t0[3], features_t1[3]], 1)
        attention4 = self.attention_layer4(fm4)

        downsampled_attention1 = self.downsample1(attention1)
        cat_attention2 = torch.cat([downsampled_attention1,attention2], 1)
        downsampled_attention2 = self.downsample2(cat_attention2)
        cat_attention3 = torch.cat([downsampled_attention2,attention3], 1)
        downsampled_attention3 = self.downsample3(cat_attention3)
        final_attention_map = torch.cat([downsampled_attention3,attention4], 1)
        features_map = [final_attention_map,attention4,attention3,attention2,attention1]
        return features_map

Tags: Python Computer Vision

Posted by fydwell on Sun, 17 Apr 2022 00:36:58 +0930