Gradiend descent
26 Dec 2017
Demonstration of normal (Batch) gradient descend formulas
$y$ is the training input (always 0 or 1)
$m$ is the size of de dataset input x
$w$ are the parameters of logistic regression
First assumption is that can aproximate the likehood with the sigmoid function:
Cost funcion (is the likelihood negated) is the error on every iteration (cross entropy error) and depends on likelihood.
This formula is derived from calculation of likelihood of the $m$ inputs as:
Note: we assume that the output has an output binary class k = 2, that have redundant and opposed probabilities.
Retuning to the cost ($J(w)$):
There is a global minimum thar can be achieved using gradient descent.
There are several types of gradient descent implementation:
- Batch gradient descent: iteration (new coefs) is done as the sum of all training examples
- Stochastic gradient descent: iteration on each example, converges faster than batch gradient descent in case of large training dataset.
- Mini-batch gradient descent: iteration on a group of $b$ examples, in some cases can converge faster than batch or stochastic.
- Stochastic gradient descent with momentum: each gradient has a momentum added from the previous gradient:
Important implementation concepts:
- Debug results to adjust the learning rate
- Possibly set dinamic learning rate as: $\alpha = \dfrac{const_1}{interationNumber + const_2}$ so learning rate decreases as it aproaches to the minimum.
An special case of stochastic gracient descent is the Online learning. Learning is done continuously from a source flow that generates training examples. Each example is used for train and ignored (not stored).
source: coursera machine learning course
Glosary
- MLE: Maximum Likelihood Estimation
- sigmoid: function to map $[-\infty,\infty]$ to $[0,1]$
- Epoch: Epoch means one iteration of Stochastic Gradient Descent