A stack of linear layers with activation functions (e.g., sigmoids)
Optimization with gradient descent.
Lots of weights! (at least one per pixel!)
Lots of weights! (at least one per pixel!)
Is it interesting to look at relations in the whole image ?
Look at small neighborhoods (where the objects are)
Look at small neighborhoods (where the objects are)
Create neurons that take a patch input
Translation of the object must lead to same behaviour of the neurons
Translation of the object must lead to same behaviour of the neurons
Use the same neuron (i.e. all the neurons of the layer share weights)
Let
Then:
Backward weight update
Let
Finally, the update rule:
Same with term to term multiplication:
Gabor filters.
First layer of AlexNet.
With the previous convolution, the output dimension is the same as the output dimension.
For classification: only one label, need for dimension reduction.
Max signal
Gradient transmission to max signal origin, zero otherwise
Very good results on digits recognition !
Smoother gradient converges faster.
Changes in the signal dynamic make the model more difficult to optimize: exponential or vanishing gradients.
Objective: control the signal distribution:
Learning is faster (iteration number) but slower (statistics computation).
Weights have great influence on convergence speed.
They are randomly initialized.
Conservation of signal properties.
finally $ Var(Y) = n Var(X_i) Var(W_i)$ and we chose
Update rule:
Same idea as mini batch: smooth gradient in the good direction
Use previous gradient to ponderate the direction of the new gradient.
Train data must be representative of the problem
-Compute mean
-Normalize input
Random variations of input parameters (images: lightness, contrast \dots)
Neural Network notes