Convolutional neural networks

Convolutional neural networks – CNNs or convnets for short – are at the heart of deep learning, emerging in recent years as the most prominent strain of neural networks. They have revolutionized computer vision, achieving state-of-the-art results in fundamental tasks, and have been widely deployed by tech companies for many of the new services and features we see today. They have numerous and diverse applications, including:

Although convnets have been around since the 1980s and have their roots in earlier neuroscience research, they have only recently achieved fame in the wider scientific community for a series of remarkable successes in important scientific problems across multiple domains. They extend neural networks primarily by introducing a new kind of layer, designed to improve the network’s ability to cope with variations in position, scale, and viewpoint. Additionally, they have become increasingly deep, containing upwards of dozens or even hundreds of layers, forming detailed compositional models of images, sounds, as well as game boards and other spatial data structures.

Because of their success at vision-oriented tasks, they have been adopted by interactive and new media artists, allowing their installations not only to detect movement, but to actively identify, describe, and track objects in physical spaces.

The next few chapters will focus on convnets and their applications, with this one formulating them and how they are trained, the next one describing their properties, and subsequent chapters focusing on their creative and artistic applications.

Weaknesses of ordinary neural nets

To understand the innovations convnets offer, it helps to first review the weaknesses of ordinary neural networks, which are covered in more detail in the prior chapter Looking inside neural nets.

Recall that in a trained one-layer ordinary neural network, the weights between the input pixels and the output neurons end up looking like templates for each output class. This is because they are constrained to capture all the information about each class in a single layer. Each of these templates looks like an average of samples belonging to that class.

Weights review

In the case of the MNIST dataset, we see that the templates are relatively discernible and thus effective, but for CIFAR-10, they are much more difficult to recognize. The reason is that the image categories in CIFAR-10 have a great deal more internal variation than MNIST. Images of dogs may contain dogs which are curled up or outstretched, have different fur colors, be cluttered with other objects, and various other distortions. Forced to learn all of these variations in one layer, our network simply forms a weak template of dogs, and fails to accurately recognize unseen ones.

We can combat this by creating more hidden layers, giving our network the capacity to form a hierarchy of discovered features. For instance, we saw that many of the pictures of horses in CIFAR-10 are of left-facing and right-facing horses, making the above template resemble a two-headed horse. If we create a hidden layer, our network could potentially form learn a “right-facing horse” and “left-facing horse” template in the hidden layer, and the output neuron for horse could have strong weights to each of them.

todo: multilayer weights

This makes some intuitive sense and gives our network more flexibility, but it’s impractical for the network to be able to memorize the nearly endless set of permutations which would fully characterize a dataset of images. In order to capture this much information, we’d need far too many neurons for what we can practically afford to store or train. The advantage of convnets is that they will allow us to capture these permutations in a more efficient way.


How can we encode variations among many classes of images efficiently? We can get some intuition to this question by considering an example.

Suppose I show you a picture of a car that you’ve never seen before. Chances are you’ll be able to identify it as a car by observing that it is a permutation of the various properties of cars. In other words, the picture contains some combination of the parts that make up most cars, including a windshield, wheels, doors, and exhaust pipe. By recognizing each of the smaller parts and adding them up, you realize that this picture is of a car, despite having never encountered this precise combination of those parts.

A convnet tries to do something similar: learn the individual parts of objects and store them in individual neurons, then add them up to recognize the larger object. This approach is advantageous for two reasons. One is that we can capture a greater variety of a particular object within a smaller number of neurons. For example, suppose we memorize 10 templates for different types of wheels, 10 templates for doors, and 10 for windshields. We thus capture $10 * 10 * 10 = 1000$ different cars for the price of only 30 templates. This is much more efficient than keeping around 1000 separate templates for cars, which contain much redundancy within them. But even better, we can reuse the smaller templates for different object classes. Wagons also have wheels. Houses also have doors. Ships also have windshields. We can construct a set of many more object classes as various combinations of these smaller parts, and do so very efficiently.

Antecedents and history of convnets

Before formally showing how convnets detect these kinds of features, let’s take a look at some of the important antecedents to them, to understand the evolution of our methods for combating the problems we described earlier.

Experiments of Hubel & Wiesel (1960s)

During the 1960s, neurophysiologists David Hubel and Torsten Wiesel conducted a series of experiments to investigate the properties of the visual cortices of animals. In one of the most notable experiments, they measured the electrical responses from a cat’s brain while stimulating it with simple patterns on a television screen. What they found was that neurons in the early visual cortex are organized in a hierarchical fashion, where the first cells connected to the cat’s retinas are responsible for detecting simple patterns like edges and bars, followed by later layers responding to more complex patterns by combining the earlier neuronal activities.

Hubel + Wiesel

Later experiments on macaque monkeys revealed similar structures, and continued to refine an emering understanding of mammallian visual processing. Their experiments would provide an early inspiration to artificial intelligence researchers seeking to construct well-defined computational frameworks for computer vision.

Hubel & Wisel: Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex

Fukushima’s Neocognitron (1982)

Hubel and Wiesel’s experiments were directly cited as inspiration by Kunihiko Fukushima in devising the Neocognitron, a neural network which attempted to mimic these hierarchical and compositional properties of the visual cortex. The neocognitron was the first neural network architecture to use hierarchical layers where each layer is responsible for detecting a pattern from the output of the previous one, using a sliding filter to locate it anywhere in the image.


Although the neocognitron achieved some success in pattern recognition tasks, it was limited by the lack of a training algorithm to learn the filters. This meant that the pattern detectors were manually engineered for the specific task, using a variety of heuristics and techniques from computer vision. At the time, backpropagation had not yet been applied to train neural nets, and thus there was no easy way to optimize neocognitrons or reuse them on different vision tasks.

LeNet (1998)

During the 1990s, a team at AT&T Labs led by Yann LeCun trained a convolutional network, nicknamed “LeNet”, to classify images of handwritten digits to an accuracy of 99.3%. Their system was used for a time to automatically read the numbers in 10-20% of checks printed in the US. LeNet had 7 layers, including two convolutional layers, with the architecture summarized in the below figure.


Their system was the first convolutional network to be applied to an industrial-scale application. Despite this triumph, many computer scientists believed that neural networks would be incapable of scaling up to recognition tasks involving more classes, higher resolution, or more complex content. For this reason, computer vision would continue to be mostly carried out by other algorithms for more than another decade.

AlexNet (2012)

Convolutional networks began to take over computer vision – and by extension, machine learning more generally – in the early 2010s. In 2009, researchers at the computer science department at Princeton University, led by Fei-Fei Li, compiled the ImageNet database, a large-scale dataset containing over 14 million labeled images which were manually annotated into 1000 classes using Mechanical Turk. ImageNet was by far the largest such dataset ever released and quickly became a staple of the research community. A year later, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was launched as an annual competition for computer vision researchers working with ImageNet. The ILSVRC welcomed researchers to compete on a number of important benchmarks, including classification, localization, detection, and others – tasks which will be described in more detail later in this chapter.

The [Mechanical Turk]( backend used to provide labels for ImageNet. Source: [Dave Gershgorn](

For the first two years of the competition, the winning entries all used what were then standard approaches to computer vision, and did not involve the use of convolutional networks. The top-winning entries in the classification tasks had a top-5 error (did not guess the correct class in top-5 predictions) between 25 and 28%. In 2012, a team from the University of Toronto led by Geoffrey Hinton, Ilya Sutskever, and Alex Krizhevsky submitted a deep convolutional neural network nicknamed “AlexNet” which won the competition by a dramatic margin of over 40%. AlexNet broke the previous record for top-5 classification error from 26% down to 15%.


Starting the following year, nearly all entries to ILSVRC were deep convolutional networks, and classification error has steadily tumbled down to nearly 2% in 2017, the last year of ILSVRC. Convnets now even outperform humans in ImageNet classification. These monumental results have largely fueled the excitement about deep learning that would follow, and many consider them to have revolutionized computer vision as a field. Furthermore, many important research breakthroughs that are now common in network architectures – such as residual layers – were introduced as entries to ILSVRC.

todo: ImageNet timeline

How convnets work

Despite having their own proper name, convnets are not categorically different from the neural networks we have seen so far. In fact, they inherit all of the functionality of those earlier nets, and improve them mainly by introducing a new type of layer, namely a convolutional layer, along with a number of other innovations emulating and refining the ideas introduced by neocognitron. Thus any neural network which contains at least one convolutional layer can be regarded as a convnet.

Convolutional layers

Prior to this chapter, we’ve just looked at fully-connected layers, in which each neuron is connected to every neuron in the previous layer. Convolutional layers break this assumption. They are actually mathematically very similar to fully-connected layers, differing only in the architecture. Let’s first recall that in a fully-connected layer, we compute the value of a neuron $z$ as a weighted sum of all the previous layer’s neurons, $z=b+\sum{w x}$.

Weights analogy

We can interpret the set of weights as a feature detector which is trying to detect the presence of a particular feature. We can visualize these feature detectors, as we did previously for MNIST and CIFAR. In a 1-layer fully-connected layer, the “features” are simply the the image classes themselves, and thus the weights appear as templates for the entire classes.

In convolutional layers, we instead have a collection of smaller feature detectors–called convolutional filters– which we individually slide along the entire image and perform the same weighted sum operation as before, on each subregion of the image. Essentially, for each of these small filters, we generate a map of responses–called an activation map–which indicate the presence of that feature across the image.

The process of convolving the image with a single filter is given by the following demo.

todo: rebuild mouse demo, button for changing filter/weight, click on filters

In the above demo, we are showing a single convolutional layer on an MNIST digit. In this particular network at this layer, we have exactly 8 filters, and below we show each of the corresponding 8 activation maps.

Each of the pixels of these activation maps can be thought of as a single neuron in the next layer of the network. Thus in our example, since we have 8 filters generating $25 * 25$ sized maps, we have $8 * 25 * 25 = 5000$ neurons in the next layer. Each neuron signifies the amount of presence of a feature at a particular xy-point. It’s worth emphasizing the differences in our visualization above to what we have seen before; in prior chapters, we always viewed the neurons (activations) of ordinary neural nets as one long column, whereas now we are viewing them as a set of activation maps. Although we could also unroll these if we wish, it helps to continue to visualize them this way because it gives us some visual understanding of what’s going on. We will refine this point in a later section.

Convolutional layers have a few properties, or hyperparameters, which must be set in advance. They include the size of the filters ($5x5$ in the above example), the stride and spatial arrangement, and padding. A full explanation of these is beyond the scope of the chapter, but a good overview of these can be found here.

Pooling layers

Before we explain the significance of the convolutional layers, let’s also quickly introduce pooling layers, another (much simpler) kind of layer, which are very commonly found in convnets, often directly after convolutional layers. These were originally called “subsampling” layers by LeCun et al, but are now generally referred to as pooling.

The pooling operation is used to downsample the activation maps, usually by a factor of 2 in both dimensions. The most common way of doing this is max pooling which merges the pixels in adjacent 2x2 cells by taking the maximum value among them. The figure below shows an example of this.

Max pooling

The advantage of pooling is that it gives us a way to compactify the amount of data without losing too much information, and create some invariance to translational shift in the original image. The operation is also very cheap since there are no weights or parameters to learn.

Recently, pooling layers have begun to gradually fall out of favor. Some architectures have incorporated the downsampling operation into the convolutional layers themselves by using a stride of 2 instead of 1, making the convolutional filters skip over pixels, and result in activation maps half the size. These “all-convolutional nets” have some important advantages and are becoming increasingly common, but have not yet totally eliminated pooling.


Let’s zoom out from what we just looked at and see the bigger picture. From this point onward, it helps to interpret the data flowing through a convnet as a volume. In previous chapters, our visualizations of neural networks always “unrolled” the pixels into a long column of neurons. But to visualize convnets properly, it makes more sense to continue to arrange the neurons in accordance to their actual layout in the image, as we saw in the last demo with the eight activation maps.

In this sense, we can think of the original image as a volume of data. Let’s consider the previous example. Our original image is 28 x 28 pixels and is grayscale (1 channel). Thus it is a volume whose dimensions are 28x28x1. In the first convolutional layer, we convolved it with 8 filters whose dimensions are 5x5x1. This gave us 8 activation maps of size 24x24. Thus the output from the convolutional layer is size 24x24x8. After max-pooling it, it’s 12x12x8.

What happens if the original image is color? In this case, our analogy scales very simply. Our convolutional filters would then also be color, and therefore have 3 channels. The convolution operation would work exactly as it did before, but simply have three times as many multiplications to make; the multiplications continue to line up by x and y as before, but also now by channel. So suppose we were using CIFAR-10 color images, whose size is 32x32x5, and we put it through a convolutional layer consisting of 20 filters of size 7x7x3. Then the output would be a volume of 26x26x20. The size in the x and y dimensions is 26 because there are 26x26 possible positions to slide a 7x7 filter into inside of a 32x32 image, and its depth is 20 because there are 20 filters.

todo: formula for xy size of volumes


We can think of the stacked activation maps as a sort-of “image.” It’s no longer really an image of course because there are 20 channels instead of just 3. But it’s worth seeing the equivalent representations; the input image is a volume of size 32x32x3, and the output from the first convolutional layer is a volume of size 26x26x20. Seeing the equivalence of these forms is crucial because it will help us understand the gist of the next section.

Things get deep

Ok, here’s where things are going to get really tricky! The whole chapter has been leading up to this section; we are going to design a full convolutional neural network to classify MNIST handwritten digits which will contain three convolutional layers and three pooling layers, followed by two fully connected layers. We are going to visualize the activations at each step of the way, and try to interpret what’s going on.

todo: convnet w/ three convs visualization

What do multiple convolutional layers give us?

In the first convolutional layer, we deployed 8 activation feature detectors to find small multi-pixel patterns in the original image, giving us a volume of information corresponding to the presence of those features inside the image, which was subsequently pooled into a 12x12x8 resulting volume. Then we did another convolution on that volume. What is this second convolution achieving? Recall that the first conv is detecting patterns in the pixels of the original input image. In that case, it follows that the second conv is detecting patterns in the “pixels” of the volume resulting from the first conv (and pool). But those “pixels” aren’t actually the original image pixels, but rather they signify the presence of the first layer features. So therefore, the second conv is detecting patterns among the features found

Improving CIFAR-10 accuracy

Applications of convnets

Since the early 2010s, convnets have ascended to become the most widely used deep learning algorithms for a variety of applications. Once considered successful only for a handful of specific computer vision tasks, they are now also depoloyed for audio, text, and other types of data processing. They owe their versatility to the automation of feature extraction, something which was once the most time-consuming and costly process necessary for applying a learning algorithm to a new task. By incorporating feature extraction itself into the training process, it’s now possible to reappropriate a convnet’s architecture, often with very few changes, into a totally different task or even domain, and retrain it.

Although a full review of these is out of the scope of this chapter, this section will highlight a number of them.

In computer vision

Besides for image classification, convnets can be trained to perform a number of tasks which give us more granular information about images. One task closly associated with classification is that of localization: assigning a bounding rectangle for the primary subject of the classification. This task is typically posed as a regression alongisde the classification, where the network must acurately predict the coordinates of the box (x, y, width, and height).

This task can be extended more ambitiously to the task of object detection. In object detection, rather than classifying and localizing a singl subject in the image, we allow for the presence of multiple objects to be located and classified within the image. The below image summarizes these three associated tasks.

Classification, localization, and detection are the building blocks of more sophisticated computer vision systems. Source: Stanford CS231n

Object detection has only become recently possible. One of the major limitations holding it back – besides for its increased complexity compared to single-class classification – was a lack of available data. Even the imagenet dataset, which was used to take classification to the next level, was unable to do anything about detection because it had no bounding rectangle information. But more recnt datasets like MS-COCO have enabled us to pursue localization and detection, as it contains much richer data describing its roughly 330,000 images.

All of the early attempts at training convnets to do multiple object detection mostly of using localization to identify potential bounding boxes, then simply applied classification to all of those boxes, and kept the ones in which it had the most confidence. This approach however is slow because it requires at least one forward pass of the network for each of the dozens or even hundreds of candidates. In certain situations, the slowness is unacceptable. For example, an autonomous vehicle needs to be able to identify roads, pedestrians, and obstacles in real-time. For obvious reasons, it cannot wait so long, and thus demands real-time speed.

Recently, a powerful framework developed by Josph Redmon called YOLO has been proposed. YOLO – which stands for “you only look once” – restricts the network to only “look” at the image a single time, i.e. it is permitted a single forward pass of the network to obtain all the information it needs, hence the name. It has in some circumstances achieved a 40-90 frames-per-second speed on multiple object detection, making it capable of being depoloyed in real-time scenarios demanding such responsiveness. The approach is to divide the image into a grid of equally-sized regions, and have each one predict a candidate object along with its classification and bounding box. At the end, those regions with the highest confidence are kept. The figure below summarizes this approach.

Real-time object detection is possible by training a network to output classifications and localizations for all found objects simultaneously. Source: You Only Look Once: Unified, Real-Time Object Detection (Redmon et al)
Some examples of YOLO detecting objects in image. Source: YOLO9000: Better, Faster, Stronger (Redmon)

Still more tasks relevant to computer vision have been introduced or improved in the last few years, including the closely related image segmentation task, and systems specialized for retrieving text from images. Another class of tasks involves annotating images with natural language by combining convnets with recurrent neural networks. This chapter will leave those to be found in future chapters, or within the included links for further reading.

Perhaps one of the most surprising aspects about convnets is their versatility, and their success in the audio domain. Although most introductions to convnets (like this chapter) emphasize computer vision, convnets have been achieving state-of-the-art results in the audio domain for just as long. Convnets are routinely used for speech recogntion and other audio information retrieval work, supplanting older approaches over the last few years as well. Prior to the ascendancy of neural networks into the audio domain, speech-to-text was typically done using a hidden markov model along with handcrafted audio feature extraction done using conventional digital signal processing.

A more recent use case for convnets in audio is that of WaveNet, introduced by researchers at DeepMind in late-2016. WaveNets are capable of learning how to synthesize audio by training on large amounts of raw audio. WaveNets have been used by Magenta to create custom digital audio synthesizers and are now used to generate the voice of Google Assistant.

Diagram depicting CD-CNN-HMM, an architecture used for speech recognition. The convnet is used to learn features from a waveform's spectral representation. Source: Speech Recognition Wiki
WaveNets are used to create a generative model for probabilistically producing audio one sample at a time. Source: DeepMind

Generative applications of convnets, including in the image domain and associated with computer vision, as well as those that also make use of recurrent neural networks, are also left to future chapters.

further reading: