This is one of the very first convolutional neural networks, many of the important concepts origin from this paper. It discussed through the whole process of abstracting digit recognition problem, choosing machine learning method, utilizing back propagation, analysis and utilizing convolutional neural network and finally construct the network.

The LeNet5 architecture was fundamental because of several core concepts:

  • image features are distributed across the entire image
  • convolutions with learnable parameters are an effective way to extract similar features at multiple location with few parameters
  • This is in contrast to using each pixel as a separate input of a large multi-layer neural network. LeNet5 explained that those should not be used in the first layer, because images are highly spatially correlated, and using individual pixel of the image as separate input features would not take advantage of these correlations.

Modeling the Learning Problem:

There are several approaches to automatic machine learning, but one of the most successful approaches can be called “numerical” or gradient-based learning. The learning machine computes a function where is the th input pattern, and W represents the collection of adjustable parameters in the system.
In our case, the pattern recognition problem, the output is the recognized label of patterns, or we can use probability associated with these labels as the output.

Then we should define the loss function, to measure the discrepancy between predicted pattern and the ground truth.

Then the learning process can be generalized as finding the value of W that minimize loss.

Relationship between error rate on test set and training set:

Generally, there is a equqation to represent the relationship between these variables:

the gap between the expected error rate on the test set and the error rate on the training set decreases with the number of training samples. P is the number of training samples, h is the measure of “efficient capacity” or complexity of the machine. is a number between 0.5 and 1.0, and is a constant. This gap always decreases when the number of training samples increases. Furthermore, as the capacity increases, decreases. Therefore, when increasing the capacity there is a tradeoff between the decrease of training error rate and the increase of the gap, with an optimal value of the capacity that achieves the lowest generalization error. Most learning algorithms attempt to minimize training error rate as well as some estimate of the gap.

Gradient-Based Learning

Since our goal is to minize the loss function. Gradient-based learning draws on the fact that it is generally much easier to minimize a reasonably smooth, continuous function than a discrete (combinatorial) function.

A simple way of updating the parameter W is to update it towards the gradient of minimizing the loss:

And there are several choices that we can make when we are updating the gradient, for example we can update the weight every time with one training data, or a batch of data.

Gradient Back Propagation

Gradient-based learning procedures have been used since the late 1950’s, but they were mostly limited to linear systems. The surprising usefulness of such simple gradient descent techniques for complex machine learning tasks was not widely realized until the following three events occurred.

  • Realization that, despite early warnings to the contrary [12], the presence of local minima in the loss function does not seem to be a major problem in practice
  • Popularization of a simple and efficient procedure to compute the gradient in a nonlinear system composed of several layers of processing, i.e., the back-propagation algorithm.
  • Demonstration that the back-propagation procedure applied to multilayer NN’s with sigmoidal units can solve complicated learning tasks

Continuous Back Propagation

For NN of many layers, we can make back propagation through the network.
If the partial derivative of Ep with respect to Xn is known, then the partial derivatives of Ep with respect to Wn and Xn-1 can be computed through backward recurrence

Fully Connected Layer vs Convolutional Layer

Problem of using only fully connected layer:

  • Using FC in the first layer with large number of hidden units can generate very large parameter set, which requires a lot of training data.
  • FC have no built-in varience to deal with translations or local distortion of the input.
  • Even if it can produce output that are invariant with respect to these variations, it would result in multiple units with similar weights pattern in various location
  • this is the incentive for shared weights in conv layer

  • Topology of the input is ignored, since any order of input are invariant to the FC layer
  • this is also the incentive for conv layer

  • While convolutional layer can deal with this problem

Core Ideas in Convolutional layer

Local receptive fields

  • The idea of connecting units to local receptive fields on the input goes back to the perceptron in the early 1960’s, and it was almost simultaneous with Hubel and Wiesel’s discovery of locally sensitive,
  • With local receptive fields neurons can extract elementary visual features such as oriented edges, endpoints, corners (or similar features in other signals such as speech spectrograms). These features are then combined by the subsequent layers in order to detect higher order features

Shared weights

  • Units in a layer are organized in planes within which all the units share the same set of weights
  • A complete convolutional layer is composed of several feature maps (with different weight vectors), so that multiple features can be extracted at each location
  • As stated earlier, all the units in a feature map share the same set of 25 weights and the same bias, so they detect the same feature at all possible locations on the input

Spatial/temporal subsampling

  • (Incentive) Not only is the precise position of each of those features irrelevant for identifying the pattern, it is potentially harmful because the positions are likely to vary for different instances of the character.
  • A simple way to reduce the precision with which the position of distinctive features are encoded in a feature map is to reduce the spatial resolution of the feature map. This can be achieved with a so-called subsampling layer

Reference