Intro to Convolutional Neural Network (Personal Notes)

This is my personal note when I study convolutional neural network by myself. The main motivation for this blog is to record some core concept and design choice of convolutional neural work. Although today’s neural network has much similar structure (conv, fc, dropout, lrn ..etc.). Much of these design choices are tried and tested by many generations of researchers. I was strongly wondering about these choice, so when I first implement the AlexNet, I stoped in the second line and decided to figure out reasons for these choices.

Incentive

Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs.
Two basic cell types have been identified: Simple cells respond maximally to specific edge-like patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern.
Sparse Connectivity
the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields.

Local Receptive Field

it’ll help to think instead of the inputs as a 28×28 square of neurons, whose values correspond to the 28×2828×28 pixel intensities we’re using as inputs
we’ll connect the input pixels to a layer of hidden neurons. But we won’t connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image.
each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a 5×5 region
if we have a 28×28 input image, and 5×5 local receptive fields, then there will be 24×24 neurons in the hidden layer.

Shared weights

we’re going to use the same weights and bias for each of the 24×24 hidden neurons
This means that all the neurons in the first hidden layer detect exactly the same feature, just at different locations in the input image
We call the weights defining the feature map the shared weights, we call the bias defining the feature map in this way the shared bias
A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network.

Reference

Max Pooling

This is done to in part to help over-fitting by providing an abstracted form of the representation.
As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation.

Different Gradient Descent Algorithms

Three variants of gradient descent

Batch gradient descent

compute the gradient for the whole training dataset
Cons: slow, may not be able to fit the whole memory

Stochastic gradient descent (SGD)

performs a parameter update for each training example, and label
Pros: (because of flutuation) can jump out of local minima
Almost same convergence behavior as batch gradient

Mini-batch gradient descent

performs an update for every mini-batch of n training examples

However, there are some challenges:

Learning rate is difficult to choose
There should be a schedule of learning rate, for example, decreasing the learning rate can help to converge
Different features might need different learning rate, because some might be sparse, some might be dense
Avoid being trapped in local minima

Optimized Solutions

Momentum

Momentum is a method that helps accelerate SGD in the relevant direction
This can help to overcome the problem of sparse feature updates
But it might flutuate in and out of minima

Nesterov accelerated gradient

Since momentum gives us a idea about where we are going to be
We can look ahead at this future position and calculate the gradient, and then use this gradient to update current gradient

Adagrad

adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.

Adadelta

Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.

Adam

Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vtvt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mtmt, similar to momentum:

Reference:

http://sebastianruder.com/optimizing-gradient-descent/

Dropout

Incentive:

Accomplish the effect of ‘combining prediction of many different models’, which can reduce test errors
But this requires many neural networks or one neural network to be trained in different ways

Method:

Randomly set 50% of output of each hidden neuron to be zero
Thus these neurons do not participated in forward pass and back-propagation

Effect

This effect is just like each time a different architecture, but all these architectures share same weights

Advantage

Because neurons can not rely on the prescence of particular other neurons, so it reduces co-adaptation of neurons
It forces neurons to learn more robust features