Cute image classification - Conv network finally breathed a sigh: defeated Swin's ConvNeXt

Conv network finally breathed a sigh of relief: it blew up Swin's ConvNeXt

ConvNeXt data:

Thesis address: https://arxiv.org/abs/2201.03545

GitHub Code: https://github.com/facebookresearch/ConvNeXt

Comparison between ConvNeXt and major ranking models:

Unconventional ConvNext network:

Design of ConvNeXt network:

                · Macro design         

                · ResNeXt

                 · Inverted bottleneck

                 · Large kerner size

                 · Various layer-wise Micro designs

ConvNeXt various versions view:

ConvNeXt backbone network code view:

pytorch version:

                TensorFlow version:

ConvNeXt data:

Thesis address: https://arxiv.org/abs/2201.03545

github Code: https://github.com/facebookresearch/ConvNeXt

Comparison between ConvNeXt and major ranking models:

When Vision Transformer appeared in those years, it was said to have exploded the major conv networks and achieved the list brushing. Later, swing transformer was even better.  

We can't help thinking, is the era of Conv network really over?

Moreover, most Transformer networks are complex and difficult to read, and the amount of computation is very large. At least, my computer can't run Transformer networks, while Conv networks are much more lovely.

Now, Conv network has finally breathed a sigh of relief: Transformer network, which is known to explode all major Conv networks, has been overwhelmed by Conv network technology!

Why do I say that? Let's first look at the comparison of several groups of data:

This is the result shown in the ConvNeXt network paper. We can clearly see that in the ImageNet data set, ConvNeXt has achieved list brushing, and what is more valuable is that the ConvNeXt network structure is very simple! The whole trunk only needs about 100 lines of code!

ConvNeXt not only performs well in ImageNet data sets, but also the accuracy of COCO and other data sets is still higher than that of swing transformer under the same FLOPs

Of course, the ingenious design of Transformer network is still very impressive.

Unconventional ConvNext network:

ConvNext does not introduce any new structure, and only makes very subtle modifications on the previous structure, so as to achieve a better network than swing transformer.

Design of ConvNeXt network:

                · Macro design         

MacroDesign is divided into two parts: stage ratio and'patchlify stem '. The author has improved MacroDesign in these two aspects.  

Convolution structure:

Improvement of stage ratio:

stage ratio is mainly conv2 x : conv3. x : conv4. x : conv5. x

We can see that the ratio of ResNet50 is (3:4:6:3), and ConvNeXt changes the ratio to: (3:3:9:3) what are the benefits of doing so?

With such a change, the accuracy rate was increased from 78.8% to 79.4%.  

patchlify stem improvements:

In convolutional neural network, we call the initial down sampling module conv1 Stem

We can see that the Stem of ResNet50 is 7 × 7 and the maximum pooled down sampling with step size of 2. The author changed it to: Stem with convolution kernel size of 4 and step size of 4.

Such a change: the accuracy is improved to 79.5%.

                · ResNeXt

ResNext consists of depth conv and width.

ResNeXt achieves a better balance between FLOPs and accuracy than ordinary ResNet. Here the author adopts the more radical depthwise revolution.

                 · Inverted bottleneck

The Inverted bottleneck is mainly composed of converting dims.  

The author believes that the MLP module in Transformer block is very similar to the inverted blockleneck module in MobileNetV2, that is, the two ends are thin and the middle is thick.

After the author joined the Inverted bottleneck, the accuracy of the smaller model was increased from 80.5% to 80.6%, and the accuracy of the larger model was increased from 81.9% to 82.6%.

                 · Large kerner size

Moving up depthwise conv layer, depthwise_conv module moves up

It turned out to be 1 × 1 conv --> depthwise_ conv --> 1 × 1conv

Now it becomes depthwise_ conv --> 1 × 1 conv --> 1 × 1conv

Increasing the kernel size, changing the convolution kernel size of depthwise conv from 3 × 3 changed to 7 × seven

                 · Various layer-wise Micro designs

Improvements:

·Replacing ReLU with GELU

·Fewer activation functions (reduce the use of activation functions)

·Fewer normalization layers

·Substituting BN with LN

·Separate downsampling layers

ConvNeXt various versions view:

ConvNeXt-T:          C=(96,192,384,768),         B=(3,3,9,3)

ConvNeXt-S:         C=(96,192,384,768),         B=(3,3,27,3)

ConvNeXt-B:         C=(128,256,512,1024),     B=(3,3,27,3)

ConvNeXt-L:          C=(192,384,768,1536),       B=(3,3,27,3)

ConvNeXt-XL:       C=(256,512,1024,2048),   B=(3,3,27,3)

ConvNeXt backbone network code view:

pytorch version:

"""
original code from facebook research:
https://github.com/facebookresearch/ConvNeXt
"""

import torch
import torch.nn as nn
import torch.nn.functional as F


def drop_path(x, drop_prob: float = 0., training: bool = False):
    """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).

    This is the same as the DropConnect impl I created for EfficientNet, etc networks, however,
    the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper...
    See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for
    changing the layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use
    'survival rate' as the argument.

    """
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output


class DropPath(nn.Module):
    """Drop paths (Stochastic Depth) per sample  (when applied in main path of residual blocks).
    """
    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)


class LayerNorm(nn.Module):
    r""" LayerNorm that supports two data formats: channels_last (default) or channels_first.
    The ordering of the dimensions in the inputs. channels_last corresponds to inputs with
    shape (batch_size, height, width, channels) while channels_first corresponds to inputs
    with shape (batch_size, channels, height, width).
    """

    def __init__(self, normalized_shape, eps=1e-6, data_format="channels_last"):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(normalized_shape), requires_grad=True)
        self.bias = nn.Parameter(torch.zeros(normalized_shape), requires_grad=True)
        self.eps = eps
        self.data_format = data_format
        if self.data_format not in ["channels_last", "channels_first"]:
            raise ValueError(f"not support data format '{self.data_format}'")
        self.normalized_shape = (normalized_shape,)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        if self.data_format == "channels_last":
            return F.layer_norm(x, self.normalized_shape, self.weight, self.bias, self.eps)
        elif self.data_format == "channels_first":
            # [batch_size, channels, height, width]
            mean = x.mean(1, keepdim=True)
            var = (x - mean).pow(2).mean(1, keepdim=True)
            x = (x - mean) / torch.sqrt(var + self.eps)
            x = self.weight[:, None, None] * x + self.bias[:, None, None]
            return x


class Block(nn.Module):
    r""" ConvNeXt Block. There are two equivalent implementations:
    (1) DwConv -> LayerNorm (channels_first) -> 1x1 Conv -> GELU -> 1x1 Conv; all in (N, C, H, W)
    (2) DwConv -> Permute to (N, H, W, C); LayerNorm (channels_last) -> Linear -> GELU -> Linear; Permute back
    We use (2) as we find it slightly faster in PyTorch

    Args:
        dim (int): Number of input channels.
        drop_rate (float): Stochastic depth rate. Default: 0.0
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, dim, drop_rate=0., layer_scale_init_value=1e-6):
        super().__init__()
        self.dwconv = nn.Conv2d(dim, dim, kernel_size=7, padding=3, groups=dim)  # depthwise conv
        self.norm = LayerNorm(dim, eps=1e-6, data_format="channels_last")
        self.pwconv1 = nn.Linear(dim, 4 * dim)  # pointwise/1x1 convs, implemented with linear layers
        self.act = nn.GELU()
        self.pwconv2 = nn.Linear(4 * dim, dim)
        self.gamma = nn.Parameter(layer_scale_init_value * torch.ones((dim,)),
                                  requires_grad=True) if layer_scale_init_value > 0 else None
        self.drop_path = DropPath(drop_rate) if drop_rate > 0. else nn.Identity()

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        shortcut = x
        x = self.dwconv(x)
        x = x.permute(0, 2, 3, 1)  # [N, C, H, W] -> [N, H, W, C]
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        if self.gamma is not None:
            x = self.gamma * x
        x = x.permute(0, 3, 1, 2)  # [N, H, W, C] -> [N, C, H, W]

        x = shortcut + self.drop_path(x)
        return x


class ConvNeXt(nn.Module):
    r""" ConvNeXt
        A PyTorch impl of : `A ConvNet for the 2020s`  -
          https://arxiv.org/pdf/2201.03545.pdf
    Args:
        in_chans (int): Number of input image channels. Default: 3
        num_classes (int): Number of classes for classification head. Default: 1000
        depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
        dims (int): Feature dimension at each stage. Default: [96, 192, 384, 768]
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
        head_init_scale (float): Init scaling value for classifier weights and biases. Default: 1.
    """
    def __init__(self, in_chans: int = 3, num_classes: int = 1000, depths: list = None,
                 dims: list = None, drop_path_rate: float = 0., layer_scale_init_value: float = 1e-6,
                 head_init_scale: float = 1.):
        super().__init__()
        self.downsample_layers = nn.ModuleList()  # stem and 3 intermediate downsampling conv layers
        stem = nn.Sequential(nn.Conv2d(in_chans, dims[0], kernel_size=4, stride=4),
                             LayerNorm(dims[0], eps=1e-6, data_format="channels_first"))
        self.downsample_layers.append(stem)

        # Corresponding to the three downsample s before stage2-stage4
        for i in range(3):
            downsample_layer = nn.Sequential(LayerNorm(dims[i], eps=1e-6, data_format="channels_first"),
                                             nn.Conv2d(dims[i], dims[i+1], kernel_size=2, stride=2))
            self.downsample_layers.append(downsample_layer)

        self.stages = nn.ModuleList()  # 4 feature resolution stages, each consisting of multiple blocks
        dp_rates = [x.item() for x in torch.linspace(0, drop_path_rate, sum(depths))]
        cur = 0
        # Build stacked block s in each stage
        for i in range(4):
            stage = nn.Sequential(
                *[Block(dim=dims[i], drop_rate=dp_rates[cur + j], layer_scale_init_value=layer_scale_init_value)
                  for j in range(depths[i])]
            )
            self.stages.append(stage)
            cur += depths[i]

        self.norm = nn.LayerNorm(dims[-1], eps=1e-6)  # final norm layer
        self.head = nn.Linear(dims[-1], num_classes)
        self.apply(self._init_weights)
        self.head.weight.data.mul_(head_init_scale)
        self.head.bias.data.mul_(head_init_scale)

    def _init_weights(self, m):
        if isinstance(m, (nn.Conv2d, nn.Linear)):
            nn.init.trunc_normal_(m.weight, std=0.2)
            nn.init.constant_(m.bias, 0)

    def forward_features(self, x: torch.Tensor) -> torch.Tensor:
        for i in range(4):
            x = self.downsample_layers[i](x)
            x = self.stages[i](x)

        return self.norm(x.mean([-2, -1]))  # global average pooling, (N, C, H, W) -> (N, C)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.forward_features(x)
        x = self.head(x)
        return x


def convnext_tiny(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_tiny_1k_224_ema.pth
    model = ConvNeXt(depths=[3, 3, 9, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_small(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_small_1k_224_ema.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_base(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_base_1k_224_ema.pth
    # https://dl.fbaipublicfiles.com/convnext/convnext_base_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[128, 256, 512, 1024],
                     num_classes=num_classes)
    return model


def convnext_large(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_large_1k_224_ema.pth
    # https://dl.fbaipublicfiles.com/convnext/convnext_large_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[192, 384, 768, 1536],
                     num_classes=num_classes)
    return model


def convnext_xlarge(num_classes: int):
    # https://dl.fbaipublicfiles.com/convnext/convnext_xlarge_22k_224.pth
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[256, 512, 1024, 2048],
                     num_classes=num_classes)
    return model

TensorFlow version:

import numpy as np
import tensorflow as tf
from tensorflow.keras import layers, initializers, Model

KERNEL_INITIALIZER = {
    "class_name": "TruncatedNormal",
    "config": {
        "stddev": 0.2
    }
}

BIAS_INITIALIZER = "Zeros"


class Block(layers.Layer):
    """
    Args:
        dim (int): Number of input channels.
        drop_rate (float): Stochastic depth rate. Default: 0.0
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, dim, drop_rate=0., layer_scale_init_value=1e-6, name: str = None):
        super().__init__(name=name)
        self.layer_scale_init_value = layer_scale_init_value
        self.dwconv = layers.DepthwiseConv2D(7,
                                             padding="same",
                                             depthwise_initializer=KERNEL_INITIALIZER,
                                             bias_initializer=BIAS_INITIALIZER,
                                             name="dwconv")
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.pwconv1 = layers.Dense(4 * dim,
                                    kernel_initializer=KERNEL_INITIALIZER,
                                    bias_initializer=BIAS_INITIALIZER,
                                    name="pwconv1")
        self.act = layers.Activation("gelu")
        self.pwconv2 = layers.Dense(dim,
                                    kernel_initializer=KERNEL_INITIALIZER,
                                    bias_initializer=BIAS_INITIALIZER,
                                    name="pwconv2")
        self.drop_path = layers.Dropout(drop_rate, noise_shape=(None, 1, 1, 1)) if drop_rate > 0 else None

    def build(self, input_shape):
        if self.layer_scale_init_value > 0:
            self.gamma = self.add_weight(shape=[input_shape[-1]],
                                         initializer=initializers.Constant(self.layer_scale_init_value),
                                         trainable=True,
                                         dtype=tf.float32,
                                         name="gamma")
        else:
            self.gamma = None

    def call(self, x, training=False):
        shortcut = x
        x = self.dwconv(x)
        x = self.norm(x, training=training)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)

        if self.gamma is not None:
            x = self.gamma * x

        if self.drop_path is not None:
            x = self.drop_path(x, training=training)

        return shortcut + x


class Stem(layers.Layer):
    def __init__(self, dim, name: str = None):
        super().__init__(name=name)
        self.conv = layers.Conv2D(dim,
                                  kernel_size=4,
                                  strides=4,
                                  padding="same",
                                  kernel_initializer=KERNEL_INITIALIZER,
                                  bias_initializer=BIAS_INITIALIZER,
                                  name="conv2d")
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")

    def call(self, x, training=False):
        x = self.conv(x)
        x = self.norm(x, training=training)
        return x


class DownSample(layers.Layer):
    def __init__(self, dim, name: str = None):
        super().__init__(name=name)
        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.conv = layers.Conv2D(dim,
                                  kernel_size=2,
                                  strides=2,
                                  padding="same",
                                  kernel_initializer=KERNEL_INITIALIZER,
                                  bias_initializer=BIAS_INITIALIZER,
                                  name="conv2d")

    def call(self, x, training=False):
        x = self.norm(x, training=training)
        x = self.conv(x)
        return x


class ConvNeXt(Model):
    r""" ConvNeXt
        A Tensorflow impl of : `A ConvNet for the 2020s`  -
          https://arxiv.org/pdf/2201.03545.pdf
    Args:
        num_classes (int): Number of classes for classification head. Default: 1000
        depths (tuple(int)): Number of blocks at each stage. Default: [3, 3, 9, 3]
        dims (int): Feature dimension at each stage. Default: [96, 192, 384, 768]
        drop_path_rate (float): Stochastic depth rate. Default: 0.
        layer_scale_init_value (float): Init value for Layer Scale. Default: 1e-6.
    """
    def __init__(self, num_classes: int, depths: list, dims: list, drop_path_rate: float = 0.,
                 layer_scale_init_value: float = 1e-6):
        super().__init__()
        self.stem = Stem(dims[0], name="stem")

        cur = 0
        dp_rates = np.linspace(start=0, stop=drop_path_rate, num=sum(depths))
        self.stage1 = [Block(dim=dims[0],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage1_block{i}")
                       for i in range(depths[0])]
        cur += depths[0]

        self.downsample2 = DownSample(dims[1], name="downsample2")
        self.stage2 = [Block(dim=dims[1],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage2_block{i}")
                       for i in range(depths[1])]
        cur += depths[1]

        self.downsample3 = DownSample(dims[2], name="downsample3")
        self.stage3 = [Block(dim=dims[2],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage3_block{i}")
                       for i in range(depths[2])]
        cur += depths[2]

        self.downsample4 = DownSample(dims[3], name="downsample4")
        self.stage4 = [Block(dim=dims[3],
                             drop_rate=dp_rates[cur + i],
                             layer_scale_init_value=layer_scale_init_value,
                             name=f"stage4_block{i}")
                       for i in range(depths[3])]

        self.norm = layers.LayerNormalization(epsilon=1e-6, name="norm")
        self.head = layers.Dense(units=num_classes,
                                 kernel_initializer=KERNEL_INITIALIZER,
                                 bias_initializer=BIAS_INITIALIZER,
                                 name="head")

    def call(self, x, training=False):
        x = self.stem(x, training=training)
        for block in self.stage1:
            x = block(x, training=training)

        x = self.downsample2(x, training=training)
        for block in self.stage2:
            x = block(x, training=training)

        x = self.downsample3(x, training=training)
        for block in self.stage3:
            x = block(x, training=training)

        x = self.downsample4(x, training=training)
        for block in self.stage4:
            x = block(x, training=training)

        x = tf.reduce_mean(x, axis=[1, 2])
        x = self.norm(x, training=training)
        x = self.head(x)
        return x


def convnext_tiny(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 9, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_small(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[96, 192, 384, 768],
                     num_classes=num_classes)
    return model


def convnext_base(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[128, 256, 512, 1024],
                     num_classes=num_classes)
    return model


def convnext_large(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[192, 384, 768, 1536],
                     num_classes=num_classes)
    return model


def convnext_xlarge(num_classes: int):
    model = ConvNeXt(depths=[3, 3, 27, 3],
                     dims=[256, 512, 1024, 2048],
                     num_classes=num_classes)
    return model

Tags: Python Deep Learning neural networks

Posted by paggard on Sat, 16 Jul 2022 06:20:04 +0930