Simple problem:
Predict if Bob will like a movie given Alice's grade
Hypothesis:
A simple threshold
Simple problem:
Predict if Bob will like a movie given Alice's and Carol's grades
Hypothesis:
An affine function
More complicated problem:
Predict if Bob will like a movie given a large user database
Hypothesis:
An affine function
More complicated problem:
Predict if Bob will like a movie given a large user database
Hypothesis:
Non-affine function ?
A simple model of the neuron: activation level is the weighted sum of the inputs
Simple problem:
Predict if Bob will like a movie given Alice's and Carol's grades
Hypothesis:
An affine function
It can be modeled with a single neuron with:
is an hyperplan of
The sign of activation
The artificial neuron
Previous course on SVM: the SVM was originally formulated using neurons.
Multilayer feed-forward networks with as few as one hidden layer are universal approximators
Find
for
Objective
Reach the bottom of the valley
Objective
Reach the bottom of the valley
Minimizing an objective function
For every smooth function
i.e., following the opopsite direction of the gradient leands to a local minimum of the function.
An iteratieve agorithm:
Too high learning rate.
Too low learning rate.
Simple problem:
Predict if Bob will like a movie given Alice's and Carol's grades
Hypothesis:
An affine function
Sign function is:-
We need a function to compare predictions and ground truth.
A function such that:
Possibilities:
Iteratively compute the output of the network
Itertatively compute the derivatives starting from the output
Update weights according to learning rate.
Loss function
Single neuron model
Derivatives:
Loss function
Single neuron model
Derivatives
Loss function
Two neurons model
Derivatives:
We do not want to explicitely compute the loss function.
In practice, we make an intensive use of the chain rule:
or for three functions:
class Module:
def __init__(self, …):
self.weights = ...
def forward(self, x): y = function(f) # compute output
self.ctx = # save the stuff for backward
# (save computation time)
return y
def backward(self, grad_output):
self.grad_weights = … * grad_output # compute gradient w.r.t. parameters
grad_input = … * grad_output # compute gradient w.r.t. input
return grad_input # return gradient w.r.t. input for use
# in previous layer
def update_weights(self, lr):
self.weights = self.weights - lr * self.grad_weights # apply gradient descent
Find
for all
No access to the whole set of movies, only a training subset:
with
Minimizing over
Approximate the training set by picking only one sample at each iteration
Is it the same as gradient descent?
Very slow convergence.
Average gradient over batches.
A batch = random subset of training set
(All neural network librairies handle batches)
Numpy style
batch # size (B, in_size)
w # size (out_size, in_size)
B # size (out_size)
output = []
for i in range(batch.shape[0]):
temp = w @ batch[i] + b
output.append(temp)
output = np.stack(axis=0)
output # size (B, out_size)
With batch operations
batch # size (B, in_size)
w # size (in_size, out_size)
B # size (out_size)
output = batch @ w + B
output # size (B, out_size)
Information estimate the number of bits required to encode/transmit an event:
Information
Entropy is the number of bits to encode/transmit a random event:
Entropy
Cross entropy estimate the number of bits to transmit from one distribution
For one sample
For a dataset:
(averaged for insensibility to dataset size)
For classification, let
Then:
Let
Let
(e.g.,
Can we use a single output for multi-label classification?
Can we use a single output for multi-label classification?
Predict a vector, one value per class:
Highest value is the selected class:
What loss can we use?
Seeing the output as a distribution probability allows to use cross-entropy
Let
What
Good properties associated with cross entropy:
And derivative:
Neural Network notes