Classic Nets & AlexNet Structure and Analysis

This is the first utilization of convolutional neural network that achieve state of art result in the image classification competition (ILSVRC). Amazingly, the accuracy of Alexnet exceeds previous record by large scale (16.4% top-5 error rate compared to 26.2% in previous record). Driven by the convolutional neural network, there are huge advantages over mannual-implemented filters/classifiers. While there are still some challenges such as the selection of activation functions, overcoming overfitting. Below are my brief note about the analysis of AlexNet: the structure and techniques utilized by the neural network.

Layer Construction:

Layer 0: Input image

* Size: 227 x 227 x 3
* Note that in the paper referenced above, the network diagram has 224x224x3 printed which appears to be a typo.

Layer 1: Convolution with 96 filters, size 11×11, stride 4, padding 0

* Size: 55 x 55 x 96
* (227-11)/4 + 1 = 55 is the size of the outcome
* 96 depth because 1 set denotes 1 filter and there are 96 filters

Layer 2: Max-Pooling with 3×3 filter, stride 2

* Size: 27 x 27 x 96
* (55 – 3)/2 + 1 = 27 is size of outcome
* depth is same as before, i.e. 96 because pooling is done independently on each layer

Layer 3: Convolution with 256 filters, size 5×5, stride 1, padding 2

* Size: 27 x 27 x 256
* Because of padding of (5-1)/2=2, the original size is restored
* 256 depth because of 256 filters

Layer 4: Max-Pooling with 3×3 filter, stride 2

* Size: 13 x 13 x 256
* (27 – 3)/2 + 1 = 13 is size of outcome
* Depth is same as before, i.e. 256 because pooling is done independently on each layer

* Layer 5: Convolution with 384 filters, size 3×3, stride 1, padding 1

* Size: 13 x 13 x 384
* Because of padding of (3-1)/2=1, the original size is restored
* 384 depth because of 384 filters

Layer 6: Convolution with 384 filters, size 3×3, stride 1, padding 1

* Size: 13 x 13 x 384
* Because of padding of (3-1)/2=1, the original size is restored
* 384 depth because of 384 filters

Layer 7: Convolution with 256 filters, size 3×3, stride 1, padding 1

* Size: 13 x 13 x 256
* Because of padding of (3-1)/2=1, the original size is restored
* 256 depth because of 256 filters

Layer 8: Max-Pooling with 3×3 filter, stride 2

* Size: 6 x 6 x 256
* (13 – 3)/2 + 1 = 6 is size of outcome
* Depth is same as before, i.e. 256 because pooling is done independently on each layer

Layer 9: Fully Connected with 4096 neuron

* In this later, each of the 6x6x256=9216 pixels are fed into each of the 4096 neurons and weights determined by back-propagation.

Layer 10: Fully Connected with 4096 neuron
```
* Similar to layer #9
```

Layer 11: Fully Connected with 1000 neurons

* This is the last layer and has 1000 neurons because IMAGENET data has 1000 classes to be predicted.

Parameter size:

Layer 0:

Memory: 227 x 227 x 3

Layer 1: (conv + ReLU + LRN)

Memory: 55 x 55 x 96 x 3 (because of ReLU and LRN)
Weights: 11 x 11 x 3 x 96

Layer 2: (pooling)

Memory: 27 x 27 x 96

Layer 3: (conv + ReLU + LRN)

Memory: 27 x 27 x 256 x 3 (because of ReLU and LRN)
Weights: 5 x 5 x 96 x 256

Layer 4: (pooling)

Memory: 13 x 13 x 256

Layer 5: (conv + ReLU)

Memory: 13 x 13 x 384 x 2 (because of ReLU)
Weights: 3 x 3 x 256 x 384

Layer 6: (conv + ReLU)

Memory: 13 x 13 x 384 x 2 (because of ReLU)
Weights: 3 x 3 x 384 x 384

Layer 7: (conv + ReLU)

Memory: 13 x 13 x 256 x 2 (because of ReLU)
Weights: 3 x 3 x 384 x 256

Layer 8: (pooling)

Memory: 6 x 6 x 256

Layer 9: (FC + ReLU + Dropout)

Memory: 4096 x 3 (because of ReLU and Dropout)
Weights: 4096 x (6 x 6 x 256)

Layer 9: (FC + ReLU + Dropout)

Memory: 4096 x 3 (because of ReLU and Dropout)
Weights: 4096 x 4096

Layer 10: (FC)

Memory: 1000
Weights: 4096 x 1000

Total (label and softmax not included)

Memory: 2.24 million
Weights: 62.37 million

Reference:

Visualization

https://dgschwend.github.io/netscope/#/preset/alexnet

Analysis

ReLU layer for activation

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating
this make the network converges faster than original tanh activation function

LRN

But with ReLU, the output of that layer is not normalized as tanh or sigmoid. So we need a normalization layer to make the output normalized

There are two ways utilized to reduce overfitting: more data; more robust network structure.

Data Augmentation (more data)

Translation and horizontal reflection
later the intensities of RGB channels

Dropout (more robust network structure)

Incentive:

Accomplish the effect of ‘combining prediction of many different models’, which can reduce test errors
But this requires many neural networks or one neural network to be trained in different ways

Method:

Randomly set 50% of output of each hidden neuron to be zero
Thus these neurons do not participated in forward pass and back-propagation

Effect

This effect is just like each time a different architecture, but all these architectures share same weights

Advantage

Because neurons can not rely on the prescence of particular other neurons, so it reduces co-adaptation of neurons
It forces neurons to learn more robust features