IOGS - ATSI
A stack of linear layers with activation functions (e.g., sigmoids)
Optimization with gradient descent.
Lots of weights! (at least one per pixel!)
Is it interesting to look at relations in the whole image ?
Look at small neighborhoods (where the objects are)
Create neurons that take a patch input
Translation of the object must lead to same behaviour of the neurons
Use the same neuron (i.e. all the neurons of the layer share weights)
Let be the coordinates in the input map.
be the size of the patch (size of the kernel, usually )
Backward weight update
Let be the output map and be the gradient coming back:
Finally, the update rule:
Same with term to term multiplication:
First layer of AlexNet.
With the previous convolution, the output dimension is the same as the output dimension.
For classification: only one label, need for dimension reduction.
Gradient transmission to max signal origin, zero otherwise
Very good results on digits recognition !
Smoother gradient converges faster.
Changes in the signal dynamic make the model more difficult to optimize: exponential or vanishing gradients.
Objective: control the signal distribution:
and are learnt, and are computed (mean and standard deviation).
Learning is faster (iteration number) but slower (statistics computation).
Weights have great influence on convergence speed.
They are randomly initialized.
Conservation of signal properties.
, weights and output
finally $ Var(Y) = n Var(X_i) Var(W_i)$ and we chose
Update rule: (learning rate )
Same idea as mini batch: smooth gradient in the good direction
Use previous gradient to ponderate the direction of the new gradient.
is the momentum.
Train data must be representative of the problem
-Compute mean and standard deviation on the train set.
-Normalize input (train and test):
Random variations of input parameters (images: lightness, contrast \dots)
Neural Network notes