Lecture 2: Convolutional neural networks¶

Previous lecture: Basic concepts¶

  • General discussion: how deep learning is different from classical machine learning
  • Supervised/unsupervised learning: abstract formulation
  • Fully connected neural networks: What it is and why depth matters
  • The concept of backpropagation and 'cheap gradients': why it is important to compute the gradient in a fast way (and how)
  • Convolutional neural networks: Brief definition of the CNN (to be followed tomorrow)
  • Popular deep learning libraries: Tensorflow, Pytorch and Jax.

Todays lecture¶

Convolutional neural networks in more details:

  • Motivation of using convolutions (and connection to classical image processing)
  • Basic building blocks of a CNN
  • Overview of the main architectures (LeNet, AlexNet, VGG, ResNet, Inception, EfficientNet, MobileNet)

Why convolutions for images?¶

  • Images exhibit locality properties.
  • Many classical image transformations can be written down as convolutions.

Some transforms¶

  • How to sharpen the image?
  • Edge detect?
  • Strong edge detect?

Pretrained filters of AlexNet¶

The first layer of the AlexNet model gives the following filters.

They are nice and smooth, which indicates the network has trained!

Note, that there are other visualization tools for the inner parameters of artificial neural networks!

Cross correlation¶

Strictly speaking, it is not a convolution, but cross-correlation.

Padding and strides¶

The convolutions have problems with the boundary; If we do it many times, it will reduce the width and height of the image.

A typical solution is too add pixels around the border.

We can also shift the window not by 1 row/column, but by larger strides. This will downsample the object.

Strides¶

LeNet¶

The first convolutional neural network architecture architecture was LeNet (tested on MNIST dataset).

What is average pooling ?

Average pooling¶

  • Pooling is a common operation in CNNs used to downsample feature maps
  • Average pooling takes the average value of each sub-region of the input feature map
  • It reduces the size of the feature map while introducing a form of spatial invariance
  • Average pooling helps to reduce overfitting and improve generalization
  • However, it may lose information about the precise location of features
  • Alternatives to average pooling include max pooling, adaptive pooling
In [26]:
#Demo of adaptive & average poolingx
m = nn.AdaptiveAvgPool2d((5, 7))
input = torch.randn(1, 64, 8, 9)
output = m(input)
print(output.shape)
# target output size of 7x7 (square)
m = nn.AdaptiveAvgPool2d(7)
input = torch.randn(1, 64, 10, 9)
output = m(input)
# target output size of 10x7
m = nn.AdaptiveAvgPool2d((None, 7))
input = torch.randn(1, 64, 10, 9)
output = m(input)
torch.Size([1, 64, 5, 7])

Performance of LeNet¶

LeNet has been tested on MNIST dataset (handwritten digits, 60000 samples).

This is an easy task.

Later, more complicated datasets have been introduced (CIFAR-10, CIFAR-100, Pascal VOC).

But they were still small and low-resolution, deep learning at that moment did not seem very promising for them.

ImageNet dataset¶

In 2006, Fei-Fei Li proposed the creation of an ImageNet dataset.

'While most of the people pay attention to models, lets pay attention to data'.

In July 2008, ImageNet had zero images. By December, it had categorized three million images across 6,000+ synsets. In April 2010, there were more than 11 million images in 15,000+ synsets. Such results would have been inconceivable for a handful of researchers. They were made possible through crowdsourcing on Amazon’s Mechanical Turk platform.

In 2010, the first ever ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was organized. Software programs competed to correctly classify and detect objects and scenes.

Current record of the ImageNet (paperwithcode)

AlexNet¶

AlexNet -- convolutional neural network that won the Imagenet challenge.

The design of AlexNet and LeNet are very simple! (LeNet - left, AlexNet - right). Sigmoid is replaced by simple ReLU activations.

What are the problems with this architecture?

AlexNet: discussion¶

Two problems: large convolutional filters in the beginning, and two large ML in the end ($6400 \times 4096$ and $4096 \times 4096$)

The total number of trainable parameters is around $60M$ parameters.

VGG¶

In 2014, Simonyan and Zisserman proposed VGG (Visual Geometry Group) in Oxford network.

Key idea: use multiple $3 \times 3$ convolutions between MaxPooling downsampling.

Receptive field of a $5 \times 5$ convolution is similar to the receptive field of two $3 \times 3$ convolutions, but the latter has less number of parameters! ($25^2$ vs $2 \times 3^2$).

They showed that deep and narrow convolutions outperform wider counterparts.

3 x 3 convolutions became de-facto standard.

VGG vs AlexNet¶

The convolutions are grouped into the transformation that does not change the dimension, followed by dimension-reduction step.

Original VGG had 5 blocks (first two have one conv. layer, others three).

Alltogether, 8 convolutional layers and 3 fully-connected layers, hence the name VGG-11.

VGG networks trained on ImageNet are excellent feature extractors.

VGG perceptual loss¶

The pretrained VGG-19 network has very interesting features. If for two images those features are close, the images are similar.

$$\mathcal{L}_{VGG,(i, j)}(\hat{y},y) = \sum_{i,j} \left\|F_{ij}^l(\hat{y}) - F_{ij}^l(y)\right\|_2^2$$

Here $\hat{y}$ is the reconstructed image, $y$ is the reference image, $F_{ij}^l(\hat{y})$ and $F_{ij}^l(y)$ are the feature maps obtained by the $j$-th convolution (after activation) before the $i$-th maxpooling layer within the VGG19 network.

GoogLeNet¶

The GoogLeNet model won ImageNet challenge in 2014 by using multibranch networks. It also has the design of low-level feature extractor (2-3 first layers), data processing and head (prediction).

Inception block¶

The inception block has the following form

Training very deep network¶

How we can increase the depth of the network without running into trouble?

The answer is suprisingly simple and given by the residual block.

Instead of mapping $x := f(x)$ we learn a mapping $x := x + f(x)$ trying to learn correction to the previous features.

Residual block¶

ResNet follows VGG design, but with residual blocks of the following form.

ResNet: architecture¶

The ResNet architecture is similar to GoogleNet, but is simpler and lead to wide adoption of such architectures.

Family of ResNet models¶

Several modifications of the basic ResNet have been proposed:

  • ResNext: in the block, information flows through several groups and then aggregated (similar to the inception).
  • WideResnet: another block, shown to be superior

Typical properties of pretrained ResNet models¶

Most popular (pretrained) models:

Model Top-1 Error Top-5 Error Number of Parameters
ResNet-18 30.43 10.76 11,689,512
ResNet-34 26.73 8.74 21,797,672
ResNet-50 24.01 7.02 25,557,032
ResNet-101 22.44 6.21 44,549,160

Factorization of convolutions¶

Consider the convolutional layer:

$$V(x, y, t) = \sum_{i=x-\delta}^{x+\delta} \sum_{j=y-\delta}^{y+\delta} \sum_{s=1}^S K(i - x + \delta, j - y + \delta, s, t) U(i, j, s)$$

Lets compute the following decomposition (called CP-decomposition)

$$ K(i,j,s,t) = \sum_{r=1}^{R} K_x(i-x+\delta,r) K_y(j-y+\delta,r) K_s(s,r) K_t(t,r) $$

Then the convolution is represented as $1 \times 1$ convolution (which is just summation over the channel domain)

and convolutions of smaller size!

MobileNet¶

In 2017, a MobileNet architecture has been proposed.

The idea is to use $1 \times 1$ convolutions plus depthwise separable convolutions (i.e., convolution is applied to each channel separatedly).

What is the complexity of such transformation?

The first architecture with depthwise separable convolutions has been used in 2016 by François Chollet (one of the authors of Keras) in the Inception block.

MobileNetV1: architecture¶

Left: VGG-type network, Right: MobileNet block. It has two parameters: width and and depth parameters (how many channels in the output, how much downsample).

Also, there are no pooling blocks. Instead, the convolution with stride = 2 is used to reduce the dimension.

MobileNetV2¶

Note that one block has Residual connection, and the other does not have residual connection. Important: the first '1 x 1 convolution' increases the nubmer of channels, the last decreases. The same expansion block is used in transformers.

Some numbers from 2018¶

Network Architecture Number of Parameters Top-1 Accuracy Top-5 Accuracy
Xception 22.91M 0.790 0.945
VGG16 138.35M 0.715 0.901
MobileNetV1 (alpha=1, rho=1) 4.20M 0.709 0.899
MobileNetV1 (alpha=0.75, rho=0.85) 2.59M 0.672 0.873
MobileNetV1 (alpha=0.25, rho=0.57) 0.47M 0.415 0.663
MobileNetV2 (alpha=1.4, rho=1) 6.06M 0.750 0.925
MobileNetV2 (alpha=1, rho=1) 3.47M 0.718 0.910
MobileNetV2 (alpha=0.35, rho=0.43) 1.66M 0.455 0.704

EfficientNet¶

Mingxing Tan 1 Quoc V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks



The idea of scaling CNN architectures is very important. ConvNet layers are often partitioned into multiple stages and all layers in each stage share the same architecture: for example,

ResNet (He et al., 2016) has five stages, and all layers in each stage has the same convolutional type

except the first layer performs down-sampling.

Therefore, we can define a ConvNet as:

$\mathcal{N} = \bigodot_{i=1 \ldots s} \mathcal{F}^{L_i}_i (X_{\langle H_i, W_i, C_i \rangle}),$

i.e. we have a layer applied $L_i$ times. The idea of EfficientNet design is not to optimize $F_i$, but choose best $H_i, W_i, C_i$ and $L_i$.

Different types of scaling¶

Typical scaling of ConvNet architectures:

  • Depth (but ResNet-1001 has similar accuracy to ResNet-100)
  • Width (WideResNets)
  • Resolution (higher input images).

Compound scaling in EfficientNet¶

\begin{equation*} \text{depth: } d = \alpha^\phi,\quad \text{width: } w = \beta^\phi,\quad \text{resolution: } r = \gamma^\phi, \end{equation*}\begin{equation*} \text{s.t. } \alpha \cdot \beta^2 \cdot \gamma^2 \approx 2,\quad \alpha \geq 1, \beta \geq 1, \gamma \geq 1. \end{equation*}

Parameter is determined by a grid search.

The baseline model (EfficientNet-B0) for the block $\mathcal{F}_i$ is determined by Neural Architecture Search.

This idea of scaling basic blocks later emerged in transformers.

Other CNNs¶

  • MobileNet (the architecture with depth-wise separable convolutions, automatic search)
  • DenseNet (every layer is connected to all preceding onces).

Convolutions for other types of data¶

We can have 1D convolutions (sequences, audio-signals).

3D convNets: for video data.

Not very big difference in concepts compared to the 2D case (just the lack of ImageNet-type datasets).

ConvNet for the 2020¶

In 2020 the Visition Transformers (VIT) have outperformed CNN for many tasks. We will discuss VIT later. However, ConvNets can be modified to work on-par with VIT using several macro/micro design solution for the architecture.

ConvNext for the 2020 paper

Key ideas:

  • Changing stem (the first block) to patchify.
  • Large kernel size (!!!!) --- moving to $7 \times 7$.
  • Inverted bottleneck
  • Several microdesign steps.

Important ingredients for better / stable training¶

  • BatchNorm: avoid gradient explosion
  • Dropout: reduce train/test gap.

They will be covered in the next lecture in more details.

Take home message¶

  • Evolution of CNN

Next lecture: Training better models¶

  • SGD optimization methods (SGD with momentum, Adam, ...)
  • Problems with training deep models: vanishing gradients, catastrophic forgetting
  • Effect of initializations
  • Normalizations
  • Early stopping
  • Interesting properties of loss surfaces of DNN models