Deep neural networks and Deep Learning are powerful and popular algorithms. And a lot of their success lays in the careful design of the neural network architecture. This post will revisit the history of neural network design in the last few years and in the context of Deep Learning.
It is the year 1994, and this is one of the very first convolutional neural networks, and what advanced the field of Deep Learning. This pioneering work by Yann LeCun was named LeNet5 after many previous successful iterations since the year 1988!
The LeNet5 architecture was fundamental, in particular the insight that image features are distributed across the entire image, and convolutions with learnable parameters are an effective way to extract similar features at multiple locations with few parameters. At the time there was no GPU to help training, and even CPUs were slow. Therefore being able to save parameters and computation was a key advantage. This is in contrast to using each pixel as a separate input of a large multi-layer neural network which is very popular today. LeNet5 explained that those should not be used in the first layer, because images are highly spatially correlated, and using individual pixel of the image as separate input features would not take advantage of these correlations.
LeNet5 features can be summarized as:
In overall this network was the origin of much of the recent architectures, and a true inspiration for many people in the field.
In the years from 1998 to 2010 neural network were in incubation. Most researchers in AI field did not notice the increasing power, while many other researchers slowly progressed. More and more data was available because of the rise of cell-phone cameras and cheap digital cameras. And computing power was on the rise, CPUs were becoming faster, and GPUs became a general-purpose computing tool. Both of these trends made neural network progress, though at a very slow rate. Both data and computing power made the tasks that neural networks tackled more and more interesting. And then its power and capability became clear to the data scientist community.
Dan Ciresan Net
In 2010 Dan Claudiu Ciresan and Jurgen Schmidhuber published one of the very first implementations of GPU Neural nets. This implementation had both forward and backward implemented on an NVIDIA GTX 280 graphic processor of an up to 9 layers neural network. This opened up the door for AI and Machine learning experts in front of neural network.
In 2012, Alex Krizhevsky released AlexNetwhich was a deeper and much wider version of the LeNet and won by a large margin the difficult ImageNet competition.
AlexNet scaled the insights of LeNet into a much larger neural network that could be used to learn much more complex objects and object hierarchies.
The contribution of this work were:
At the time GPU offered a much larger number of cores than CPUs, and allowed 10x faster training time, which in turn allowed to use larger datasets and also bigger images for training to achieve better accuracy.
The success of AlexNet started a small revolution. Convolutional neural network were now the workhorse of Deep Learning, which became the new name and the standard for “large neural networks that can now solve useful tasks”. Most of deep learning neural networks involved through learning from AlexNet network.
In December 2013 the NYU lab from Yann LeCun came up with Overfeat, which is a derivative of AlexNet. The article also proposed learning bounding boxes, which later gave rise to many other papers on the same topic. This technique is used to identify object location in image.
The VGG networks from Oxford were the first to use much smaller 3×3 filters in each convolutional layers and also combined them as a sequence of convolutions.
This seems to be contrary to the principles of LeNet, where large convolutions were used to capture similar features in an image. Instead of the 9×9 or 11×11 filters of AlexNet, filters started to become smaller, too dangerously close to the infamous 1×1 convolutions that LeNet wanted to avoid, at least on the first layers of the network. But the great advantage of VGG was the insight that multiple 3×3 convolution in sequence can emulate the effect of larger receptive fields, for examples 5×5 and 7×7. These ideas will be also used in more recent network architectures as Inception and ResNet.
The VGG networks uses multiple 3x3 convolutional layers to represent complex features. This obviously amounts to a massive number of parameters, and also learning power. But training of these network was difficult, and had to be split into smaller networks with layers added one by one. All this because of the lack of strong ways to regularize the model, or to somehow restrict the massive search space promoted by the large amount of parameters.
VGG used large feature sizes in many layers and thus inference was quite costly at run-time. Reducing the number of features, as done in Inception bottlenecks, will save some of the computational cost.
Network-in-network (NiN) had the great and simple insight of using 1x1 convolutions to provide more combinational power to the features of a convolutional layers.
The NiN architecture used spatial MLP layers after each convolution, in order to better combine features before another layer. Again one can think the 1x1 convolutions are against the original principles of LeNet, but really they instead help to combine convolutional features in a better way, which is not possible by simply stacking more convolutional layers. This is different from using raw pixels as input to the next layer. Here 1×1 convolution are used to spatially combine features across features maps after convolution, so they effectively use very few parameters, shared across all pixels of these features!
The power of MLP can greatly increase the effectiveness of individual convolutional features by combining them into more complex groups. This idea will be later used in most recent architectures as ResNet and Inception and derivatives.
NiN also used an average pooling layer as part of the last classifier, another practice that will become common. This was done to average the response of the network to multiple are of the input image before classification.
GoogLeNet and Inception
Christian Szegedy from Google begun a quest aimed at reducing the computational burden of deep neural networks, and devised the GoogLeNet the first Inception architecture.
Starting from fall 2014, deep learning models were becoming extremely useful in categorizing the content of images and video frames. Given the usefulness of these techniques, the internet giants like Google were very interested in efficient and large deployments of architectures on their server farms.
Christian thought a lot about ways to reduce the computational burden of deep neural nets while obtaining state-of-art performance (on ImageNet, for example). Or be able to keep the computational cost the same, while offering improved performance.
He and his team came up with the Inception module, which at a first glance is basically the parallel combination of 1×1, 3×3, and 5×5 convolutional filters. But the great insight of the inception module was the use of 1×1 convolutional blocks (NiN) to reduce the number of features before the expensive parallel blocks. This is commonly referred as “bottleneck”. Please see “bottleneck layer” section below for details.
GoogLeNet used a stem without inception modules as initial layers, and an average pooling plus softmax classifier similar to NiN. This classifier is also extremely low number of operations, compared to the ones of AlexNet and VGG. This also contributed to a very efficient network design.
Inspired by NiN, the bottleneck layer of Inception was reducing the number of features, and thus operations, at each layer, so the inference time could be kept low. Before passing data to the expensive convolution modules, the number of features was reduce by 4 times. This savings of computational cost is the main reason of the success of this architecture.
Let’s looked at this network in detail. Imagine you have 256 features coming in, and 256 coming out, and the Inception layer only performs 3x3 convolutions. That is 256x256 x 3x3 convolutions that have to be performed (589,000s multiply-accumulate, or MAC operations). Instead of doing this, they decide to reduce the number of features that will have to be convolved, 64 or 256/4. In this case, they first perform 256 -> 64 1×1 convolutions, then 64 convolution on all Inception branches, and then we use again a 1x1 convolution from 64 -> 256 features back again. The operations are now:
For a total of about 70,000 versus the almost 600,000 had before. Almost 10x less operations!
And although we are doing less operations, we are not losing generality in this layer. In fact the bottleneck layers have been proven to perform at state-of-art on the ImageNet dataset. The secret for the success is that the input features are correlated, and thus redundancy can be removed by combining them appropriately with the 1x1 convolutions. Then, after convolution with a smaller number of features, they can be expanded again into meaningful combination for the next layer.
Inception V3 (and V2)
Christian and his team are very efficient researchers. In February 2015 Batch-normalized Inception was introduced as Inception V2. Batch-normalization computes the mean and standard-deviation of all feature maps at the output of a layer, and normalizes their responses with these values. This corresponds to “whitening” the data, and thus making all the neural maps have responses in the same range, and with zero mean. This helps training as the next layer does not have to learn offsets in the input data, and can focus on how to best combine features.
In December 2015 they released a new version of the Inception modules and the corresponding architecture. This article better explains the original GoogLeNet architecture, giving a lot more detail on the design choices. A list of the original ideas are:
Inception modules can also decrease the size of the data by provide pooling while performing the inception computation. This is basically identical to performing a convolution with strides in parallel with a simple pooling layer:
Inception still uses a pooling layer plus softmax as final classifier.
December 2015, at about the same time as Inception v3 ResNet was introduced. ResNet have a simple idea: feed the output of two successive convolutional layer AND also bypass the input to the next layers! Bypassing after 2 layers is a key intuition, as bypassing a single layer did not give much improvements. By 2 layers can be thought as a small classifier, or a Network-In-Network!
This is also the very first time that a network of > hundred, even 1000 layers was trained. ResNet with a large number of layers started to use a bottleneck layer similar to the Inception bottleneck:
This layer reduces the number of features at each layer by first using a 1x1 convolution with a smaller output (usually 1/4 of the input), and then a 3x3 layer, and then again a 1x1 convolution to a larger number of features. Like in the case of Inception modules, this allows to keep the computation low, still providing rich combination of features.
ResNet uses a fairly simple initial layers at the input (stem): a 7x7 conv layer followed with a pool of 2. Contrast this to more complex and less intuitive stems as in Inception V3, V4.
ResNet also uses a pooling layer plus softmax as final classifier.
Additional insights about the ResNet architecture are appearing every day:
Christian and team introduced V4, the new version of Inception. The Inception module after the stem is rather similar to Inception V3. This time though the solution is less elegant and more complex, but also full of less transparent heuristics. It is hard to understand the choices and it is also hard to justify them.
SqueezeNet has been recently released. It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms.
ENet was designed to use the minimum number of resources possible from the start. As such it achieves such a small footprint that both encoder and decoder network together only occupies 0.7 MB with fp16 precision. Even at this small size, ENet is similar or above other pure neural network solutions in accuracy of segmentation.
An Analysis of Modules
A systematic evaluation of CNN modules has been presented. They found out that is advantageous to use:
Other notable architectures
FractalNet uses a recursive architecture, which was not tested on ImageNet, and is a derivative or the more general ResNet.
It is believed that crafting neural network architectures is of paramount importance for the progress of the Deep Learning field. But one could now wonder why researchers have to spend so much time in crafting architectures, and why instead we do not use data to tell us what to use, and how to combine modules. This would be nice, but now it is work in progress. Note also that here we mostly talked about architectures for computer vision. Similarly neural network architectures developed in other areas, and it is interesting to study the evolution of architectures for all other tasks also. How will the design of neural network change in other field and how good their results are will be ongoing work for neural network researching ground.