The most popular computer vision (CV) task is classification.
We already discussed the supervised learning setting, loss functions (we put softmax to predict probabilities),
and we train them on large datasets using SGD-type methods
(with actually a lot of tricks which we will discuss later, both to improve generalization and computational efficiency)
Basic architectures were also discussed before (CNN, ResNet).
Visual Transformer models will be discussed later in this course.
Typically, we talk about accuracy (percentage of correct predictions). Works well when classes are balanced (i.e. have similar number of samples).
Accuracy: $$A = \frac{TP+TN}{TP+FP+TN+FN}.$$ Precision $$P = \frac{TP}{TP+FP}.$$ Recall $$R = \frac{TP}{TP+FN}.$$ F1 score (harmonic mean between precision and recall) $$F1 = 2 \frac{R \circ P}{P + R}.$$
Finally, ROC (Receiver operating curve) is a plot between True Positive rate (TPR = Recall) and False Positive Rate (FPR),
$FPR = \frac{FP}{TN+FP}.$
We change the threshold between positive class and negative class. AUC is the area under the curve. We want to make it bigger.
The goal of object detection is (!) detect objects on images. The detection includes:
There are quite a lot of datasets for object detection, see paperswithcode
The prediction of the boxes is measured using IoU measure.
The prediction of the classes is computed using mean average precision (mAP) measure:
The reason is that we can have different classes in one image, and different thresholds have to be used.
In order to compute average precision (for object detection), we compute precision-recall curve for different threshold, i.e. if $IoU > t$ the class is positive, if this is not satisfied -- the class is negative.
Once precision and recall values are computed, we compute
$$AP = \sum_{k=0}^{k-n} [R(k)-R(k+1)] P(k),$$which is just the area under the curve for precision-recall. AP is always between 0 and 1 (check!)
In object detection we have different classes, and they can use different thresholds!
We just average AP for different classes and thats all!
This is a standard metric for benchmarking object detection.
Bunch of successful classical detectors (Viola-Jones, SIFT/Hog, used for face detection mainly)
First succesfull CNN architecture has been R-CNN architecture.
R-CNN stands for region proposal CNN
YOLO (You look only once) model family
Modern: transformer-based models
The original R-CNN has the flavour of classical methods.
First, the authors used selective search (which looks at things like histograms to measure the similarity between regions) to select large number (2000) of candidates.
Then, these regions are warped into a square and fed into a CNN which generates a vector of dimension 4096.
The extracted features are fed into SVM to classify if the object is in the region or not.
The algorithm also predicts the offset values (4 of them) for the object bounding box.
In the fast R-CNN model:
The Faster R-CNN has become much faster (more than 100x) and could be used for real-time detection.
You can change backbones as well to more modern architectures easily!
In R-CNN models, neural networks first look for regions where the object can be located and works with the regions.
In YOLO (You look only once) a single CNN predicts the bounding boxes and classes simultaneously
The input is split into $S \times S$ grid.
For each cell we predict $B$ bounding boxes. Each box is $(x, y, w, h, confidence)$. The grid prediction is done through the size of the output tensor, which is $S \times S \times (5 B + C)$.
In YOLOv2 several improvements have been made:
Faster and more accurate!
As a backbone, you can also use vision transformers as a backbone(will discuss if time permits and I will not forget to do it).
There are several types of image segmentation.
That includes:
In semantic segmentation, we need to predict the class of each pixel (i.e. all trees, all persons, etc.). This is an example of image-to-image transformation: as an input, we have an image, as the output we have the mask
In instance segmentation, we need to predict not only the class, but also the instances of the object.
Similar to object detection, but
Panoptic segmentation combines best of both worlds.
Each pixel in a scene is assigned a semantic label
(due to semantic segmentation) and a unique instance identifier (due to instance segmentation).
There are quite a few architectures for semantic segmentation including:
SegNet was the first architecture and had the following form of the image-to-image (pix2pix) architecture
U-Net is still one of the most popular architectures for image-to-image models and is used in diffusion models.
The key idea is to add skip connections to ensure multiscale processing of the input data.
Several backbones (DeepLab v.1,2, 3) use dilated (atrous) convolutions to build the encoder, for example https://arxiv.org/pdf/1706.05587v3.pdf
These convolutions replace pooling, which is good for abstract features but not for pixel predictions
A Fast Fully Convolutional network based on DeepLabv3 is used to replace dilated convolutions (which take a lot memory and time). The original FCN paper uses upsampling with dilated convolutions The fast version takes the last 3 conv layers and uses Joint Pyramid Upsampling from those layers.
There are quite a lot of different losses for the segmentation problems, see review
Type | Loss Function |
---|---|
Distribution-based Loss | Binary Cross-Entropy |
Weighted Cross-Entropy | |
Balanced Cross-Entropy | |
Focal Loss | |
Distance map derived loss penalty term | |
Region-based Loss | Dice Loss |
Sensitivity-Specificity Loss | |
Tversky Loss | |
Focal Tversky Loss | |
Log-Cosh Dice Loss | |
Boundary-based Loss | Hausdorff Distance loss |
Shape aware loss | |
Compounded Loss | Combo Loss |
Exponential Logarithmic Loss. |
Since segmentation can be viewed as classification, the distribution-based losses (including the focal loss) are clear.
One can also use the differentiable version of the Dice loss, defined as
$$DL(y, \hat{p}) = 1 - \frac{2 y \hat{y} + 1}{y + \hat{y}+1}$$It is claimed that it works better for imbalanced datasets (typically, medical data).
Many other losses and a lot of engineering!
Not lets discuss Instance Segmentation. Some of the popular architectures include:
Mask-R-CNN (Detectron by Facebook, now Detectron2.
The difference with Detectron and Detectron2 is that the latter is written in Pytorch, includes panoptic segmentation and DensePose.
The original architecture is extension of Faster R-CNN described before.
The model generates the bounding box and segmentation for each instance.
Addinal head that predicts the mask!
The total loss will be a sum of three losses (class, box, mask).
RoI Align solves the problem of different size of the feature map and the region map by using interpolation (we have to fit a part of the feature map to a known region).
PANET architecture studies a practical case of few shot image segmentation.
We will talk about zero-shot/few-shot architectures later on in the course.
The idea is that we only have few examples per class.
Details later on! (Remind me if I forget about it, since it is useful to explain few-shot classification before few-shot segmentation!).