An introductory textbook on computer vision using deep learning, assuming no prior knowledge of deep learning; thus, the first half of the book covers the basic concepts of neural network architecture and training setups. The second half is more advanced, focusing specifically on computer vision, and covers topics such as vision CNN architectures, object detection, and visual embeddings. Each chapter includes example code, showing how to train a toy project using Keras.
As someone with a background in natural language processing, I found the amount of introductory material slightly disappointing, but it could be useful for someone just starting out. The second half specific to computer vision were more interesting to me, although they cover only a handful of the many different CV tasks and concepts. Published in 2020, the book does not include some of the newer innovations like Vision Transformers and diffusion models for image generation. Its chapter on generative models focuses on GANs, which were more popular at the time.
Chapter 1: Basic tasks of computer vision include classification, object detection, generating images, face recognition, etc.; the traditional pipeline is as follows. First, the image must be resized to a standard size, followed by pre-processing, such as converting it to grayscale or adjusting colors depending on the use case. Then comes feature extraction; traditionally, this involved handcrafted algorithms to extract features from images, but nowadays, feature extraction is learned using deep learning.
Chapter 2: Basics of deep learning: single-layer perceptron can only fit a linear function; therefore, a Multi-Layer Perceptron (MLP) with hidden layers is necessary to learn more complex functions; review of activation functions, loss functions, stochastic gradient descent, and backpropagation.
Chapter 3: CNNs are useful because MLPs view an image as flat layers, making it challenging for them to learn when an object appears in different locations in the image; essentially, they lack spatial knowledge of the input. In contrast, convolutional layers learn the feature extraction part of the traditional computer vision pipeline.
Convolution filters consist of a number of kernels, roughly equivalent to the number of hidden units in a layer, and a kernel size, which is always square, ranging from 2×2 to 5×5. The stride parameter allows skipping of pixels; a stride of 1 produces an output of roughly the same size, while a stride of 2 results in an output about half the original size; strides larger than two are uncommon. Typically, zero padding is applied around the border of the image to correctly handle the edges.
The pooling layer is there to downsample the image, using either max or average pooling to reduce dimensions; it’s usually placed after a convolution layer. Throughout the feature extraction process, the image tends to become smaller but deeper. The final layer in the convolutional process is flattened, then fed into a series of fully connected layers. The parameter count in these layers depends on the kernel size, number of filters, and depth of the previous layer, but not the size of the image; the pooling layer does not have parameters. The chapter concludes with an end-to-end exercise in training AlexNet using Keras on the CIFAR-10 dataset.
Chapter 4: how to do train-test splits, set up evaluation metrics, and interpret training and validation curves. Overview of all the common hyperparameters used during training, such as learning rate, batch size, epochs, early stopping, dropout, regularization, and batch normalization. The chapter also includes several modifications to the previous chapter’s model to improve its accuracy.
Chapter 5 covers the differences between some popular CNN architectures. LeNet-5, proposed by Yann LeCun in 1998 for MNIST digit classification, had only 60k parameters and had many fundamental building blocks, such as convolution layers, pooling, and fully connected layers; it used the tanh activation function instead of ReLU.
AlexNet, with its 60M parameters, won the ImageNet competition in 2012. Its main innovations included dropout, ReLU activation, weight regularization, and multi-GPU training.
VGGNet, developed in Oxford in 2014, comes in two configurations: VGG16, which is more commonly used, and VGG19, boasting around 140 million parameters. It features several convolution layers before each pooling layer and is deeper in structure.
The inception module and GoogleNet, which won the ImageNet competition in 2014, introduced the idea of having several branches of convolution blocks with different kernel sizes to learn various filters, which are then concatenated at the end of the block. For larger convolution layers, which are expensive, a bottleneck layer (or 1×1 convolution) is used to reduce depth (e.g., from 200 to 16) before applying the convolution layer. GoogleNet starts with a base similar to AlexNet, followed by several chains of inception modules, and concludes with a fully connected layer; it is more efficient than VGG and achieves higher accuracy.
ResNet, the winner of the 2015 ImageNet challenge, uses residual connections to create a shortcut path that bypasses a series of convolution blocks. This design addresses the vanishing gradient problem and enables the training of much deeper networks, up to 152 layers. The shortcut block includes a dimension reduction layer to ensure that its dimensions match the output of the convolution blocks, allowing them to be added together.
Chapter 6. Transfer learning is useful because in most cases, you won’t start with a network from scratch but will leverage a pre-trained network and fine-tune it for your task. For example, you can freeze the feature extractor layers of VGG16 and train the FC layers to classify cats versus dogs. The procedure depends on whether your task is similar to the one the network was originally trained on and whether you have a small or large dataset. If the dataset is similar, you should freeze the feature extraction and only retrain the classifier part. However, if the dataset is very different, you may want to unfreeze everything, but it’s still beneficial to initialize from a pre-trained network rather than starting randomly. Another approach is something in between, like freezing the earlier layers of the feature extractor.
Chapter 7 – Object Detection. Generally, object detection involves several steps: first, region proposal, where a model generates numerous regions of interest (ROI); then, the features are extracted using some type of CNN. As there will be many bounding boxes for the same object, the final step is non-maximum suppression (NMS), which retains only one box for each object, discarding all but the one with the highest probability, as well as anything below a confidence threshold. Common evaluation metrics include Intersection over Union (IoU), for bounding box accuracy, and Mean Average Precision (MAP) for classification accuracy.
The most basic model is R-CNN, which consists of several models that are trained separately and then combined for object detection. Initially, it uses a selective search algorithm to identify blobs of similar colors and merge regions. Each of these regions is then fed into a CNN for feature extraction, and subsequently, the features are fed into a classifier and bounding box regression. R-CNN is very slow because it must run the CNN separately on thousands of region proposals.
Next is Fast R-CNN, which differs in that it applies the CNN once on the entire image, and then proposes regions using the selective search algorithm. It utilizes a multitask loss for both classification and bounding box regression within a single model; however, the selective search is done by a different model on the features extracted by the CNN.
Faster R-CNN does the entire detection process end-to-end within one network. It uses a Region Proposal Network (RPN) to propose regions instead of using selective search. The RPN conducts a sliding window over the image, generating k anchor boxes at each location. From each anchor box, it predicts a delta for the new center and the width and height of the bounding box. This bounding box is treated as a regression problem, predicting the x and y coordinates relative to the anchor point and its width and height. Then, the FC layer takes the output of the RPN and predicts the class, as well as refining the bounding box. This method maintains similar accuracy to the original R-CNN but is approximately 250 times faster.
The R-CNN models are two-stage models, but the fastest models are single-stage, one of which is the Single Shot Detector (SSD). This model uses multiple classifiers at different stages of the CNN feature extractor network to predict bounding boxes at various scales. The earlier layers are responsible for predicting smaller objects, while the later layers handle larger objects. However, the accuracy for smaller objects tends to be lower.
“You Only Look Once” (YOLO) is another fast object detection algorithm. YOLO divides the image into a grid of cells, where each cell directly predicts both a bounding box and object classification. To enable predictions at different scales, it uses several grids with varying cell sizes. The chapter concludes with an example of training an SSD model on an object detection dataset.
Chapter 8. Generative Adversarial Networks (GANs) consist of two parts: the discriminator, a typical classification CNN model, and the generator, a kind of reverse CNN that transforms random noise into an image. These two components are jointly trained; training the discriminator is straightforward, but training the generator requires you to construct a combined model. During this process, the discriminator’s weights are fixed while only the generator model is trained. There are several ways to evaluate the output: Inception Score uses a pre-trained network like Inception to evaluate and determine if the correct classes are generated; FID is an improvement and requires a larger sample size. GANs are useful for a wide range of image generation applications, such as text-to-image or image-to-image, chapter shows an example project of training a DCGAN on Fashion-MNIST.
Chapter 9 explores visualization techniques to understand what activates specific neurons in a neural network. One such technique is gradient ascent, where the network remains fixed while the image is optimized to maximize the activation of a particular neuron. Deep Dream is an artistic tool, it starts with an input image and transforms it into trippy picture by maximizing the activations of an entire layer of neurons. This process involves several iterative steps of upscaling and detail injection, as the network is not adept at generating very large images.
Neural Style Transfer involves transforming the style of an image while preserving its content. It uses a content loss, based on a higher layer of the network, to minimize the MSE with the original image; neurons in these higher layers detect the content of each image region and penalize deviations from the original. The style loss is more complex, as it needs to match the activations across all layers, both lower and higher, without capturing the global arrangement, which would constitute the content. This is achieved using a Gram matrix to represent the distributions of activations in layers without conveying spatial information; the image is then modified to match the style image’s Gram matrix. Additionally, a total variance loss is used to enforce smoothness throughout the image.
Chapter 10. Visual embeddings use CNNs to generate an embedding instead of making a direct prediction, which is useful for various matching and retrieval tasks. The most straightforward way to train an embedding is to train a classifier and take the last layer; however, this usually performs worse than architectures that incorporate similarity. Contrastive loss optimizes for similar classes to be closer together and different classes to be further apart, with a margin that is penalized if different classes are closer than this margin. However, the margin remains the same for all classes, even though some classes are more different than others.
Triplet loss, proposed by FaceNet, works by sampling three items at a time: an anchor point, a positive example, and a negative example. The loss function rewards scenarios where the positive example is closer to the anchor point than the negative example. However, selecting the examples presents a challenge. If you sample randomly in the data loader, it might not provide enough diversity to generate these triplets; therefore, the strategy is to sample classes first and then sample images from each class.
The next step is determining how to generate effective triplets. Many triplets will be uninformative because they are too easy; the model has already learned to distinguish them. To address this, all possible positive and negative samples are ranked as easy or hard based on their distance from the anchor. One strategy is “batch hard,” which involves selecting the hardest points to train on in each iteration. However, this approach can be drastically affected by outliers, which will always be selected. To mitigate this, FaceNet uses a “semi-hard” selection strategy that selects points that are challenging but not the most difficult. Additionally, there are various ways to randomly sample with weights according to their level of difficulty.
Two image retrieval datasets are: the DeepFashion dataset contains images of fashion items from different perspectives that belong to the same class; the VeRi dataset includes images of vehicles for re-identification purposes.