This is my personal note when I study convolutional neural network by myself. The main motivation for this blog is to record some core concept and design choice of convolutional neural work. Although today’s neural network has much similar structure (conv, fc, dropout, lrn ..etc.). Much of these design choices are tried and tested by many generations of researchers. I was strongly wondering about these choice, so when I first implement the AlexNet, I stoped in the second line and decided to figure out reasons for these choices.


  • Convolutional Neural Networks (CNN) are biologically-inspired variants of MLPs.
  • Two basic cell types have been identified: Simple cells respond maximally to specific edge-like patterns within their receptive field. Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern.

    Sparse Connectivity

  • the inputs of hidden units in layer m are from a subset of units in layer m-1, units that have spatially contiguous receptive fields.

Local Receptive Field

  • it’ll help to think instead of the inputs as a 28×28 square of neurons, whose values correspond to the 28×2828×28 pixel intensities we’re using as inputs
  • we’ll connect the input pixels to a layer of hidden neurons. But we won’t connect every input pixel to every hidden neuron. Instead, we only make connections in small, localized regions of the input image.
  • each neuron in the first hidden layer will be connected to a small region of the input neurons, say, for example, a 5×5 region
  • if we have a 28×28 input image, and 5×5 local receptive fields, then there will be 24×24 neurons in the hidden layer.

Shared weights

  • we’re going to use the same weights and bias for each of the 24×24 hidden neurons
  • This means that all the neurons in the first hidden layer detect exactly the same feature, just at different locations in the input image
  • We call the weights defining the feature map the shared weights, we call the bias defining the feature map in this way the shared bias
  • A big advantage of sharing weights and biases is that it greatly reduces the number of parameters involved in a convolutional network.


Max Pooling

  • This is done to in part to help over-fitting by providing an abstracted form of the representation.
  • As well, it reduces the computational cost by reducing the number of parameters to learn and provides basic translation invariance to the internal representation.

Different Gradient Descent Algorithms

Three variants of gradient descent

Batch gradient descent

  • compute the gradient for the whole training dataset
  • Cons: slow, may not be able to fit the whole memory

Stochastic gradient descent (SGD)

  • performs a parameter update for each training example, and label
  • Pros: (because of flutuation) can jump out of local minima
  • Almost same convergence behavior as batch gradient

Mini-batch gradient descent

  • performs an update for every mini-batch of n training examples

However, there are some challenges:

  • Learning rate is difficult to choose
  • There should be a schedule of learning rate, for example, decreasing the learning rate can help to converge
  • Different features might need different learning rate, because some might be sparse, some might be dense
  • Avoid being trapped in local minima

Optimized Solutions


  • Momentum is a method that helps accelerate SGD in the relevant direction
  • This can help to overcome the problem of sparse feature updates
  • But it might flutuate in and out of minima

Nesterov accelerated gradient

  • Since momentum gives us a idea about where we are going to be
  • We can look ahead at this future position and calculate the gradient, and then use this gradient to update current gradient


  • adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent parameters.


  • Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size w.


  • Adaptive Moment Estimation (Adam) is another method that computes adaptive learning rates for each parameter. In addition to storing an exponentially decaying average of past squared gradients vtvt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients mtmt, similar to momentum:




  • Accomplish the effect of ‘combining prediction of many different models’, which can reduce test errors
  • But this requires many neural networks or one neural network to be trained in different ways


  • Randomly set 50% of output of each hidden neuron to be zero
  • Thus these neurons do not participated in forward pass and back-propagation


  • This effect is just like each time a different architecture, but all these architectures share same weights


  • Because neurons can not rely on the prescence of particular other neurons, so it reduces co-adaptation of neurons
  • It forces neurons to learn more robust features