Stage Image by Hannah Lim

Teaser Image by Chris Reyem

Ever since their invention in 2017, transformers have been “the next big thing” in natural language processing (NLP). In recent years, vision transformers have joined their NLP counterparts and are now regularly used to solve computer vision issues. But does that mean that classic convolutional neural networks (CNN) are dead? Not yet. The authors of ConvNeXt did their best to revive ResNet, a popular CNN, by tweaking it to perform better and to be more similar to a transformer, without actually turning it into a transformer or adding attention into the mix. The visual transformer that is used as a reference in this paper is the Swin Transformer.

What was actually done?

All the following numbers for improvement in accuracy and increase in GFLOPS references the ResNet-50 trained on the ImageNet1k. 
For comparison, the basic ResNet-50 has an accuracy of 76.1% and 4.1 GFLOPS. The Swin-T, a Swin Transformer of the size “Tiny”, which is very similar to the Resnet-50 in some ways, has an accuracy of 81.3% and 4.5 GFLOPS.

Training

The most impressive improvement from ResNet to ConvNeXt happens way before the architecture is modified in any way. Through a change in training an astonishing 2.7% accuracy is gained. The adjustments that were made to the training regiment are the following:

  • Mixup: a training technique where the network is trained on a convex combination of examples and their labels [Zhang et al.]

  • Cutmix: a patch is cut from one image and pasted onto another image. The label of this new image is also adjusted from the respective image labels in proportion to the image and patch. [Yun et al.]

  • RandAugment: an automatic image augmentation that can be used out of the box. It also works with a reduced search space, meaning that it performs well on large data sets. [Cubuk et al.]

  • Random Erasing: erasing random patches of an image. [Zhong et al.]

  • Stochastic Depth: a training technique which shortens the network by randomly dropping layers during training. This allows better information and gradient flow. [Huang et al.]

  • Label Smoothing: this technique takes the marginalised effect of label dropping into account. [Szegedy et al.]

Architecture

Figure 1: ConvNeXt Architecture, where skip connections are not portrayed)

After such a great increase in accuracy just through the change in training, you might have high expectations for what the change in architecture might do. Dial it down. The adjustment of 11 distinct features to the ResNet’s architecture increases the accuracy by 3.2%, in total. Now let’s take a look that those adjustments.

1. Change the compute ratio of the ResNet Blocks used per Stage. ResNet-50 has four stages, where stages one and four have three blocks, stage two has four blocks and stage three has six blocks. Swin-T also consists of four stages but has Swin Transformer Blocks in each stage. Swin-T's stages one, two and four have two blocks, while stage three has six blocks. After adjusting the compute ratio of blocks per stage of the ResNet architecture to be more similar to that of the Swin Transformer, ConvNeXt has three blocks in stages one, two and four, with nine blocks in stage three. Accuracy +0.6%, GFLOPS +0.4

2. Change the stem to “Patchify”. The ResNet stem is a 7x7 convolutional layer. But the Swin Transformer uses non-overlapping patches with a small kernel size of four. Therefore, the ConvNeXt stem is now a 4x4 convolutional layer with a stride of four, so that the patches don’t overlap. Accuracy +0.1%, GFLOPS -0.1

The next changes are made on a smaller level, the ResNet blocks:

Figure 2: Change from ResNet block to ConvNext block

3. Change the 3x3 convolutional layer to a depthwise convolution. Depthwise convolution in its essence is grouped convolution, where the number of groups is equal to the number of channels. This operation is similar to taking the weighted sum in self-attention. Accuracy +1.0%, GFLOPS +0.9

4. Invert the bottleneck. Change the sizes of the convolutional layers in a block so that the block in the centre is the largest. Accuracy +0.1%, GFLOPS -0.7

5. Move the depthwise convolutional[SZ3] layer up so that it is now the first layer in the block. Accuracy -0.1, GFLOPS -0.5

6. Increase the kernel size of the first layer in the block to a (depthwise) 7x7 kernel. Accuracy +0.7, GFLOPS +0.1

Figure 3: More changes from ResNet Block to ConvNeXt Block

7. Replace all the Rectified Linear Units (ReLU) in the activation layers with Gaussian Error Linear Units (GELU) as they are smoother and used by Swin Transformers. Accuracy +-0%

8. Reduce the number of activation functions and keep only the one after the inverted bottleneck. Accuracy +0.7%

9. Reduce the number of normalisation layers and keep only the one before the inverted bottleneck. Accuracy +0.1%

10. Replace Batch Normalisation (BN) with Layer Normalisation (LN). +0.1%

11. Add separate downsampling layers between stages that consist of a layer normalisation, a 2x2 conv layer with a stride of 2 and another layer normalisation. Also, add layer normalisations before the first stage and after the average pooling after the fourth stage. Accuracy +0.5%, GFLOPS +0.3

Discussion

This paper was presented and discussed in our AI Reading Group. We discussed whether the gain justified the effort; after all, when it comes down to numbers, ConvNeXt barely outperforms Swin-T, even though many modifications were made. However, some excellent points were made about the approach that was taken in the paper. Adjusting network architectures in an intelligent and stringent way by copying parts and training approaches from other well-performing AI models and testing them systematically is something that should be done more frequently! We also did a small deep dive on Twitter to find out how ConvNeXt performs in comparison to Swin-T and other vision models on other benchmarks. After seeing how well ConvNeXt performs with transfer learning and how simple the code of a ConvNeXT block is in comparison to a Swin Transformer Block, we were mostly convinced that CNNs are not quite dead yet and that attention is, in fact, not all you need.

If you would like to learn more about AI basics, trends, ethics and everything in between, consider joining the AI Reading Group in the BCP. We meet once a month for one hour. A volunteer presents a paper that we all voted on beforehand for 10-20 minutes and then we discuss the shortcomings and potential of the paper, related work and anything AI. You can join the AI Reading Group on BCP here.